如何避免 Snakemake 的“扩展”功能中的“缺少输入文件”错误

如何解决如何避免 Snakemake 的“扩展”功能中的“缺少输入文件”错误

当我运行以下 snakemake 代码时，我得到一个 MissingInputException：

import re
import os

glob_vars = glob_wildcards(os.path.join(os.getcwd(),"inputs","{fileName}.{ext}"))

rule end:
    input:
        expand(os.path.join(os.getcwd(),"{fileName}_rename.fas"),fileName=glob_vars.fileName)

rule rename:
    '''
    rename fasta file to avoid problems
    '''
    input:
        expand("inputs/{{fileName}}.{ext}",ext=glob_vars.ext)
    output:
        os.path.join(os.getcwd(),"{fileName}_rename.fas")
    run:
        list_ = []
        with open(str(input)) as f2:
            line = f2.readline()
            while line:
                while not line.startswith('>') and line:
                    line = f2.readline()
                fas_name = re.sub(r"\W","_",line.strip())
                list_.append(fas_name)
                fas_seq = ""
                line = f2.readline()
                while not line.startswith('>') and line:
                    fas_seq += re.sub(r"\s","",line)
                    line = f2.readline()
                list_.append(fas_seq)
        with open(str(output),"w") as f:
            f.write("\n".join(list_))

我的 Inputs 文件夹包含以下文件：

G.bullatarudis.fasta
goldfish_protein.faa
guppy_protein.faa
gyrodactylus_salaris.fasta
protopolystoma_xenopodis.fa
salmon_protein.faa
schistosoma_mansoni.fa

错误信息是：

Building DAG of jobs...
MissingInputException in line 10 of /home/zhangdong/works/NCBI/BLAST/RHB/test.rule:
Missing input files for rule rename:
inputs/guppy_protein.fasta
inputs/guppy_protein.fa

我假设错误是由 expand 函数引起的，因为只有 guppy_protein.faa 文件存在，但 expand 还会生成 guppy_protein.fasta 和 guppy_protein.fa 文件。有什么解决办法吗？

解决方法

默认情况下，expand 将生成输入列表的所有组合，因此这是预期行为。您需要输入来查找给定文件名的正确扩展名。我还没有测试过这个：

glob_vars = glob_wildcards(os.path.join(os.getcwd(),"inputs","{fileName}.{ext}"))

# create a dict to lookup extensions given fileNames
glob_vars_dict = {fname: ex for fname,ex in zip(glob_vars.fileName,glob_vars.ext)}

def rename_input(wildcards):
   ext = glob_vars_dict[wildcards.fileName]
   return f"inputs/{wildcards.fileName}.{ext}"

rule rename:
    input: rename_input

一些不请自来的风格评论：

您不必在 glob_wildcards 前面加上 os.getcwd，glob_wildcards("inputs","{fileName}.{ext}")) 应该可以工作，因为默认情况下，snakemake 使用相对于工作目录的路径。
尝试在 python 中为变量名称坚持使用 snake_case 而不是 camalCase
在这种情况下，fileName 不是您正在捕获的内容的良好描述。也许 species_name 或 species 会更清楚

感谢 Troy Comi，我修改了我的代码并且成功了：

import re
import os
import itertools

speciess,exts = glob_wildcards(os.path.join(os.getcwd(),"inputs_test","{species}.{ext}"))

rule end:
    input:
        expand("inputs_test/{species}_rename.fas",species=speciess)

def required_files(wildcards):
    list_combination = itertools.product([wildcards.species],list(set(exts)))
    exist_file = ""
    for file in list_combination:
        if os.path.exists(f"inputs_test/{'.'.join(file)}"):
            exist_file = f"inputs_test/{'.'.join(file)}"
    return exist_file

rule rename:
    '''
    rename fasta file to avoid problems
    '''
    input:
        required_files
    output:
        "inputs_test/{species}_rename.fas"
    run:
        list_ = []
        with open(str(input)) as f2:
            line = f2.readline()
            while line:
                while not line.startswith('>') and line:
                    line = f2.readline()
                fas_name = ">" + re.sub(r"\W","_",line.replace(">","").strip())
                list_.append(fas_name)
                fas_seq = ""
                line = f2.readline()
                while not line.startswith('>') and line:
                    fas_seq += re.sub(r"\s","",line)
                    line = f2.readline()
                list_.append(fas_seq)
        with open(str(output),"w") as f:
            f.write("\n".join(list_))

如何避免 Snakemake 的“扩展”功能中的“缺少输入文件”错误

如何解决如何避免 Snakemake 的“扩展”功能中的“缺少输入文件”错误

解决方法

相关推荐