如何解决由于正在处理大量文件,Snakemake的速度呈指数下降
我目前正在写一条流水线,生成阳性RNA序列,对其进行混排,然后分析阳性序列和混排的(负)序列。例如,我想生成100个正序序列,并使用三种不同的算法将这些序列的每个随机洗净1000次。为此,我利用了两个通配符(pos_index和pred_index),分别从0到100和0到1000。最后一步,还要用另外三个工具分析所有文件。
现在是我的问题:DAG的构建过程实际上需要几个小时,而实际管道的执行速度甚至会更慢。当它启动时,它将执行一批32个作业(因为我为snakemake分配了32个内核),然后需要10到15分钟才能执行下一批作业(我猜是由于进行了一些文件检查)。管道的完整执行大约需要2个月。
下面是我的snakefile的简化示例。有什么办法,我可以通过某种方式对其进行优化,以便使snakemake及其开销不再是瓶颈?
ITER_POS = 100
ITER_PRED = 1000
SAMPLE_INDEX = range(0,ITER_POS)
PRED_INDEX = range(0,ITER_PRED)
SHUFFLE_TOOLS = ["1","2","3"]
PRED_TOOLS = ["A","B","C"]
rule all:
input:
# Expand for negative sample analysis
expand("predictions_{pred_tool}/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt",pred_tool = PRED_TOOLS,shuffle_tool = SHUFFLE_TOOLS,sample_index = SAMPLE_INDEX,pred_index = PRED_INDEX),# Expand for positive sample analysis
expand("predictions_{pred_tool}/pos_sample_{sample_index}.txt",sample_index = SAMPLE_INDEX)
# GENERATION
rule generatePosSample:
output: "samples/pos_sample_{sample_index}.clu"
shell: "sequence_generation.py > {output}"
# SHUFFLING
rule shufflePosSamples1:
input: "samples/pos_sample_{sample_index}.clu"
output: "samples/neg_sample_1_{sample_index}_{pred_index}.clu"
shell: "sequence_shuffling.py {input} > {output}"
rule shufflePosSamples2:
input: "samples/pos_sample_{sample_index}.clu"
output: "samples/neg_sample_2_{sample_index}_{pred_index}.clu"
shell: "sequence_shuffling.py {input} > {output}"
rule shufflePosSamples3:
input: "samples/pos_sample_{sample_index}.clu"
output: "samples/neg_sample_3_{sample_index}_{pred_index}.clu"
shell: "sequence_shuffling.py {input} > {output}"
# ANALYSIS
rule analysePosSamplesA:
input: "samples/pos_sample_{sample_index}.clu"
output: "predictions_A/pos_sample_{sample_index}.txt"
shell: "sequence_analysis_A.py {input} > {output}"
rule analysePosSamplesB:
input: "samples/pos_sample_{sample_index}.clu"
output: "predictions_B/pos_sample_{sample_index}.txt"
shell: "sequence_analysis_B.py {input} > {output}"
rule analysePosSamplesC:
input: "samples/pos_sample_{sample_index}.clu"
output: "predictions_C/pos_sample_{sample_index}.txt"
shell: "sequence_analysis_C.py {input} > {output}"
rule analyseNegSamplesA:
input: "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
output: "predictions_A/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
shell: "sequence_analysis_A.py {input} > {output}"
rule analyseNegSamplesB:
input: "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
output: "predictions_B/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
shell: "sequence_analysis_B.py {input} > {output}"
rule analyseNegSamplesC:
input: "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
output: "predictions_C/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
shell: "sequence_analysis_C.py {input} > {output}"
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。