由于正在处理大量文件，Snakemake的速度呈指数下降

如何解决由于正在处理大量文件，Snakemake的速度呈指数下降

我目前正在写一条流水线，生成阳性RNA序列，对其进行混排，然后分析阳性序列和混排的（负）序列。例如，我想生成100个正序序列，并使用三种不同的算法将这些序列的每个随机洗净1000次。为此，我利用了两个通配符（pos_index和pred_index），分别从0到100和0到1000。最后一步，还要用另外三个工具分析所有文件。

现在是我的问题：DAG的构建过程实际上需要几个小时，而实际管道的执行速度甚至会更慢。当它启动时，它将执行一批32个作业（因为我为snakemake分配了32个内核），然后需要10到15分钟才能执行下一批作业（我猜是由于进行了一些文件检查）。管道的完整执行大约需要2个月。

下面是我的snakefile的简化示例。有什么办法，我可以通过某种方式对其进行优化，以便使snakemake及其开销不再是瓶颈？

ITER_POS = 100
ITER_PRED = 1000

SAMPLE_INDEX = range(0,ITER_POS)
PRED_INDEX = range(0,ITER_PRED)

SHUFFLE_TOOLS = ["1","2","3"]
PRED_TOOLS = ["A","B","C"]

rule all:
    input:
        # Expand for negative sample analysis
        expand("predictions_{pred_tool}/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt",pred_tool = PRED_TOOLS,shuffle_tool = SHUFFLE_TOOLS,sample_index = SAMPLE_INDEX,pred_index = PRED_INDEX),# Expand for positive sample analysis
        expand("predictions_{pred_tool}/pos_sample_{sample_index}.txt",sample_index = SAMPLE_INDEX)


# GENERATION
rule generatePosSample:
    output: "samples/pos_sample_{sample_index}.clu"
    shell:  "sequence_generation.py > {output}"


# SHUFFLING
rule shufflePosSamples1:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "samples/neg_sample_1_{sample_index}_{pred_index}.clu"
    shell:  "sequence_shuffling.py {input} > {output}"

rule shufflePosSamples2:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "samples/neg_sample_2_{sample_index}_{pred_index}.clu"
    shell:  "sequence_shuffling.py {input} > {output}"

rule shufflePosSamples3:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "samples/neg_sample_3_{sample_index}_{pred_index}.clu"
    shell:  "sequence_shuffling.py {input} > {output}"


# ANALYSIS
rule analysePosSamplesA:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "predictions_A/pos_sample_{sample_index}.txt"
    shell:  "sequence_analysis_A.py {input} > {output}"

rule analysePosSamplesB:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "predictions_B/pos_sample_{sample_index}.txt"
    shell:  "sequence_analysis_B.py {input} > {output}"

rule analysePosSamplesC:
    input:  "samples/pos_sample_{sample_index}.clu"
    output: "predictions_C/pos_sample_{sample_index}.txt"
    shell:  "sequence_analysis_C.py {input} > {output}"

rule analyseNegSamplesA:
    input:  "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
    output: "predictions_A/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
    shell:  "sequence_analysis_A.py {input} > {output}"

rule analyseNegSamplesB:
    input:  "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
    output: "predictions_B/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
    shell:  "sequence_analysis_B.py {input} > {output}"

rule analyseNegSamplesC:
    input:  "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
    output: "predictions_C/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
    shell:  "sequence_analysis_C.py {input} > {output}"

由于正在处理大量文件，Snakemake的速度呈指数下降

如何解决由于正在处理大量文件，Snakemake的速度呈指数下降

相关推荐