如何解决使用蛇形通配符重命名文件
我在解决snakemake 中遇到的问题时遇到了麻烦。我的样本目前被命名为“1-2-Brain_187_006_S77_L002_R1_001.fastq.gz”。我想最终将它们重命名为更短的名称,如“1_2_Brain_S77_L002_R1”,然后为我的规则使用扩展名“_trim.fastq.gz”。我正在使用 bbduk 进行修剪。对于我的输入,我想调用我的字典列表 allSamples。然后我想访问每个字典中的值。具体来说,“shortName1”和“shortName2”值。我的问题是我的问题,它将整个列表显示为单次运行的输入。我不确定如何使它注册每个元素都是它自己的运行。当我实际上有 114 个文件名时,我将显示 3 个文件名作为示例。因此,我希望我的试运行有 114 个修剪作业的计数。
config.json
{
"allSamples" : ['1_2_Brain_S77_L002','10_4_Kidney_S82_L002','11_4_BB_S105_L002' ......],"1_2_Brain_S77_L002":{
"sampleName1": "1-2-Brain_187_006_S77_L002_R1_001.fastq.gz","sampleName2": "1-2-Brain_187_006_S77_L002_R2_001.fastq.gz","shortName1": "1_2_Brain_S77_L002_R1","shortName2": "1_2_Brain_S77_L002_R2","stemName": "1_2_Brain_S77_L002"
},....
}
我将位于 rawReads/ 中的文件存储在 trimmedReads/ 中。
蛇形文件
configfile: "refs/config.json"
# variables
sampleDict = config["allSamples"]
sampleNames1 = [config[i]["sampleName1"] for i in sampleDict]
sampleNames2 = [config[i]["sampleName2"] for i in sampleDict]
shortNames1 = [config[i]["shortName1"] for i in sampleDict]
shortNames2 = [config[i]["shortName2"] for i in sampleDict]
rule all:
input:
expand("trimmedReads/{trim1}_trim.fastq.gz",trim1 = shortNames1),expand("trimmedReads/{trim2}_trim.fastq.gz",trim2 = shortNames2)
rule trim:
input:
R1 = expand("rawReads/{sample1}",sample1 = sampleNames1),R2 = expand("rawReads/{sample2}",sample2 = sampleNames2)
output:
trim1 = expand("trimmedReads/{trim1}_trim.fastq.gz",trim2 = expand("trimmedReads/{trim2}_trim.fastq.gz",trim2 = shortNames2)
shell:
"""
bbduk.sh in1={input.R1} in2={input.R2} out1={output.trim1} out2={output.trim2} ref=ref/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
"""
当我进行试运行时,我得到了这个。
Building DAG of jobs...
Job counts:
count jobs
1 all
1 trim
2
[Mon May 24 22:42:36 2021]
rule trim:
input: rawReads/1-2-Brain_187_006_S77_L002_R1_001.fastq.gz,rawReads/10-4-Kidney_127_066_S82_L002_R1_001.fastq.gz,rawReads/11-4_BB_041_152_S105_L002_R1_001.fastq.gz,...
output: trimmedReads/1_2_Brain_S77_L002_R1_trim.fastq.gz,trimmedReads/10_4_Kidney_S82_L002_R1_trim.fastq.gz,trimmedReads/11_4_BB_S105_L002_R1_trim.fastq.gz,...
jobid: 1
bbduk.sh in1=rawReads/1-2-Brain_187_006_S77_L002_R1_001.fastq.gz rawReads/10-4-Kidney_127_066_S82_L002_R1_001.fastq.gz rawReads/11-4_BB_041_152_S105_L002_R1_001.fastq.gz ... out1=trimmedReads/1_2_Brain_S77_L002_R1_trim.fastq.gz trimmedReads/10_4_Kidney_S82_L002_R1_trim.fastq.gz trimmedReads/11_4_BB_S105_L002_R1_trim.fastq.gz ... ref=ref/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
[Mon May 24 22:42:36 2021]
localrule all:
input: trimmedReads/1_2_Brain_S77_L002_R1_trim.fastq.gz,...
jobid: 0
Job counts:
count jobs
1 all
1 trim
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
解决方法
rule test:
的通配符是一个空字典。此规则中未指定通配符值 wildcards.sample
。每个通配符都应在 output:
部分中指定,并且此部分对于此规则为空。实际上,除非您将 rule test:
明确指定为目标,否则您的 ["rawReads/1_2_Brain_S77_L002","rawReads/17_6_Brain_S83_L002"]
绝对没有任何效果:如果没有指定任何输出,Snakemake 只会忽略这条没有产生任何结果的无用规则。
我猜文件 rule all:
input: ["path_to_target/foo_SampleName1_bar","path_to_target/foo_SampleName2_bar"]
# List the files you expect to get as a target
rule copy:
input:
"path_to_source/blablabla_{sample}_bazz"
output:
"path_to_target/foo_{sample}_bar"
shell:
"echo {input}; cp {input} {output}"
已经存在,所以 Snakemake 发现目标存在于磁盘上,并且什么都不做,产生“无输出”。
我不明白您所说的“最终将它们重命名为较短的名称”是什么意思,但这里有一个如何复制文件的方法。将此视为“如何使用通配符访问我的示例名称”的模式:
rule copy:
工作原理:
- Snakemake 发现它需要生成一些文件(如果这些文件是“path_to_target/foo_SampleName1_bar”、“path_to_target/foo_SampleName2_bar”)。
- Snakemake 发现
{sample}
声明输出(如果将"SampleName1"
替换为值"path_to_target/foo_SampleName1_bar"
)与文件名"path_to_source/blablabla_SampleName1_bazz"
匹配 - 如果文件
"path_to_target/foo_SampleName1_bar"
存在,则 Snakemake 满意,并且知道如何生成文件{sample}
- 对
"SampleName2"
值rule copy:
重复步骤 2、3。 - 现在它知道
I=int R=range listInput = lambda : [I(x) for x in input().strip(" ")] for x in R(I(input())):
应运行两次:每个文件一次。 - 所有依赖都解决了,Snakemake 可以启动管道了。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。