修改 snakefile 以运行一个工作流的多次迭代

如何解决修改 snakefile 以运行一个工作流的多次迭代

我有一个 Snakemake 工作流程，其中包含一个 Snakefile 和一个配置文件。在我的 Snakefile 中，我指定了一个工作，其编号是非顺序的（例如 210,215）。对于我可以指定的每个作业，配置文件都有一个相应的条目，其中包含有关该特定作业的信息（使用年份、子作业数量、文件前缀等参数，所有这些都存储为字符串）。在规则中，为了构造输入和输出，我使用 config[job]["year"] 之类的语句为每个作业提供正确的字符串。

我的工作流程的一个简化示例，希望能展示我如何使用配置文件中的信息：

# SNAKEFILE
job=210
rule all:
    input:
        expand(config["outputdir"]+"/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root",sample=config[job]["samples"])
...other rules...
rule filter_2:
    input:
        config["outputdir"]+"/filter-1-applied/{sj}/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root"
    output:
        config["outputdir"]+"/filter-2-applied/{sj}/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root"
    shell:
        "(bash scripts/filter-2.sh {input} {output}) 2> {log}"
...other rules...

CONFIG.YAML
outputdir="/home/ghl/outputs"
210:                                                                                                                                                                                                               
    prefix: "Real"
    year: "2016"
    origindir: "/home/ghl/files/210"
    subjobs: 2653
    originID: "_abc123"
    samples: ["type1_v1","type1_v2","type2_v1","type2_v2"]

当我有少量作业时这很好，但是现在我有大约 80 个要运行，即使在我可以访问的批量提交系统上提交，有些也需要几个小时，手动运行每个都需要很长时间，等待，更改“作业”属性，然后再次运行。我想要做的是能够从该 Snakefile 的单次运行中运行多个作业（例如 210 和 215）。

在 python 中，我会将所有这些都包含在一个循环中，例如：

for job in [1,3,...,210,215]:
    <run single job workflow>
print("Done!")

我正在尝试在我的 Snakefile 中做同样的事情。我已经尝试将 job=jobs 放在“规则全部”的输入中，就像我对样本所做的一样，并定义 jobs=[210,215]，或者将输入更改为一个函数，该函数从作业列表中返回相应的文件名，但两者都遇到了与“job”不再是脚本中的 python 变量，而是现在是通配符这一事实相关的问题，我不清楚我应该如何为 config[job]["year"] 之类的东西提供通配符：
config[{job}]["year"] 或 config["{job}"] 不起作用（具体来说，它们会给出 NameError 或 KeyError）。

有没有办法实现这一点（理想情况下无需完全重写）？按照我提到的方式进行修改（或以某种方式从单独的蛇文件运行此工作流？）将是理想的，我想这可能是可行的，只需将 config[job] 的所有实例替换为并更改“rule all”的输入以包含工作编号列表...

提前致谢！

解决方法

如果其他人想知道我是如何解决这个问题的，它需要进行一些重写，并且相当广泛地使用 lambda 函数，此外，所有文件现在都以它们的作业号为前缀（我有一个在外部运行的 bash 脚本的snakemake 将它们全部删除）。我确信其中大部分内容超出了需求，但它对我来说已经足够了。

我在配置中指定了一个作业列表：jobs: [j210,j215]（j 前缀是必需的，因为如果将它们解释为整数而不是字符串，snakemake 会得到一个关键错误，原因我不太明白）

我添加了一个额外的 make_final 规则，它只取决于作业，并修改所有（并且还使用了很多通配符约束，您对它们的需求可能会有所不同）。这使作业成为通配符，因此可以使用 lambda 函数在 config[job] 或 input 中访问 params：config[wildcards.job]

rule all:
    input:
       expand("completed/{job}.log",job=config["jobs"])

rule make_final:
    # this input is just my final file from the chain of rules. Awkward syntax as requires a list expansion - each source job produces 4 output files
    input:
        lambda wildcards : [(config["outputdir"]+"/{job}_"+config[wildcards.job]["prefix"]+"_test_"+config[wildcards.job]["year"]+config[wildcards.job]["originID"]+"_"+foobar+".root") for foobar in config[wildcards.job]["samples"]],output:
        "completed/{job}.log"
    shell:
        "touch {output}"

并且更早的规则被修改，例如像这样：

rule filter_2_mc:
    input:
        # this new approach allows neater/more natural phrasing here,rather than
        # using lots of config[job]["blah"] statements
        config["outputdir"]+"/filter-1-applied/{sj}/{job}_{prefix}_test_{year}{originID}_{sample}.root"
    output:
        config["outputdir"]+"/filter-2-applied/{sj}/{job}_{prefix}_test_{year}{originID}_{sample}.root"                                                                                                                       
    shell:
        "bash scripts/filter-2-new.sh {input} {output}"

某些规则需要 lambda 函数作为其输入：或参数：如果需要指定来自 config[wildcards.job] 的任何内容。

（如果不允许回答我自己的问题并将其标记为正确答案，也很抱歉）

修改 snakefile 以运行一个工作流的多次迭代

如何解决修改 snakefile 以运行一个工作流的多次迭代

解决方法

相关推荐