微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何在Snakemake中使用奇异和conda包装器 Snakefile config.yaml samples.txt 使用shell: 代替wrapper: 可重复性 更新3:绑定挂载

如何解决如何在Snakemake中使用奇异和conda包装器 Snakefile config.yaml samples.txt 使用shell: 代替wrapper: 可重复性 更新3:绑定挂载

TLDR 我遇到以下错误

“ conda”命令在您的奇异容器图像内不可用。 Snakemake将您的conda安装安装为单一。有时,由于外壳限制,这可能会失败。经过测试可以与docker:// ubuntu一起使用,但是例如因docker:// bash

而失败

我已经创建了Snakemake工作流程,并通过Snakemake wrappers: shell: 命令转换为基于规则的软件包管理。

但是,我遇到在HPC上运行此问题,强烈建议其中一名HPC支持人员在以下任何HPC系统上不使用conda

“”如果[包装器]的构建器不是 super 小心,则在conda环境中存在依赖于宿主库的动态库(总是存在一对,因为大多数情况下,构建器我会认为依靠Singularity来构建管道将使系统更加健壮。” - Anon

我在周末和according to this document,it's possible to combine containers with conda-based package management做过一些阅读;通过定义全局conda码头工人容器和按规则yaml文件

注意:与上面链接图5.4 )中的示例相反,该示例使用预定义的yamlshell: 命令,在这里我用 康达包装器,将这些yaml文件下载到 Singularity容器(如果我的想法正确的话),因此我思想功能应该相同-参见最后的注意: ...

Snakefileconfig.yamlsamples.txt

Snakefile

# Directories------------------------------------------------------------------
configfile: "config.yaml"

# Setting the names of all directories
dir_list = ["REF_DIR","LOG_DIR","BENCHMARK_DIR","QC_DIR","TRIM_DIR","ALIGN_DIR","MARKDUP_DIR","CALLING_DIR","ANNOT_DIR"]
dir_names = ["refs","logs","benchmarks","qc","trimming","alignment","mark_duplicates","variant_calling","annotation"]
dirs_dict = dict(zip(dir_list,dir_names))

import os
import pandas as pd
# getting the samples information (names,path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt",sep='\t',index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names,sample_locations))
# get number of samples
len_samples = len(sample_names)


# Singularity with conda wrappers

singularity: "docker://continuumio/miniconda3:4.5.11"

# Rules -----------------------------------------------------------------------

rule all:
    input:
        "resources/vep/plugins","resources/vep/cache"

rule download_vep_plugins:
    output:
        directory("resources/vep/plugins")
    params:
        release=100
    resources:
        mem=1000,time=30
    wrapper:
        "0.66.0/bio/vep/plugins"

rule get_vep_cache:
    output:
        directory("resources/vep/cache")
    params:
        species="caenorhabditis_elegans",build="Wbcel235",release="100"
    resources:
        mem=1000,time=30
    log:
        "logs/vep/cache.log"
    cache: True  # save space and time with between workflow caching (see docs)
    wrapper:
        "0.66.0/bio/vep/cache"

config.yaml

# Files
REF_GENOME: "c_elegans.PRJNA13758.WS265.genomic.fa"
GENOME_ANNOTATION: "c_elegans.PRJNA13758.WS265.annotations.gff3"

# Tools
QC_TOOL: "fastQC"
TRIM_TOOL: "trimmomatic"
ALIGN_TOOL: "bwa"
MARKDUP_TOOL: "picard"
CALLING_TOOL: "varscan"
ANNOT_TOOL: "vep"

samples.txt

sample  location
MTG324  /home/moldach/wrappers/SUBSET/MTG324_SUBSET

提交

snakemake --profile slurm --use-singularity --use-conda --jobs 2

日志

Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
        1   get_vep_cache
        1

[Mon Sep 21 15:35:50 2020]
rule get_vep_cache:
    output: resources/vep/cache
    log: logs/vep/cache.log
    jobid: 0
    resources: mem=1000,time=30

Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/VEP/.snakemake/conda/774ea575
[Mon Sep 21 15:36:38 2020]
Finished job 0.
1 of 1 steps (100%) done

注意:--use-conda遗漏到工作流程的提交之外会导致get_vep_cache:-/bin/bash: vep_install: command not found

出错
Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
        1   download_vep_plugins
        1

[Mon Sep 21 15:35:50 2020]
rule download_vep_plugins:
    output: resources/vep/plugins
    jobid: 0
    resources: mem=1000,time=30

Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/VEP/.snakemake/conda/9f602d9a
[Mon Sep 21 15:35:56 2020]
Finished job 0.
1 of 1 steps (100%) done

添加第三条规则fastq时出现问题:

更新了Snakefile

# Directories------------------------------------------------------------------
configfile: "config.yaml"

# Setting the names of all directories
dir_list = ["REF_DIR","resources/vep/cache",expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}',QC_DIR=dirs_dict["QC_DIR"],QC_TOOL=config["QC_TOOL"],sample=sample_names,pair=['R1','R2'],ext=['html','zip'])

rule download_vep_plugins:
    output:
        directory("resources/vep/plugins")
    params:
        release=100
    resources:
        mem=1000,time=30
    log:
        "logs/vep/cache.log"
    cache: True  # save space and time with between workflow caching (see docs)
    wrapper:
        "0.66.0/bio/vep/cache"

def getHome(sample):
  return(list(os.path.join(samples_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))

rule qc_before_trim_r1:
    input:
        r1=lambda wildcards: getHome(wildcards.sample)[0]
    output:
        html=os.path.join(dirs_dict["QC_DIR"],config["QC_TOOL"],"before_trim","{sample}_R1_fastqc.html"),zip=os.path.join(dirs_dict["QC_DIR"],"{sample}_R1_fastqc.zip"),params:
         dir=os.path.join(dirs_dict["QC_DIR"],"before_trim")
    log:
        os.path.join(dirs_dict["LOG_DIR"],"{sample}_R1.log")
    resources:
        mem=1000,time=30
    singularity:
        "https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    wrapper:
         "0.66.0/bio/fastqc"

rule qc_before_trim_r2:
    input:
        r1=lambda wildcards: getHome(wildcards.sample)[1]
    output:
        html=os.path.join(dirs_dict["QC_DIR"],"{sample}_R2_fastqc.html"),"{sample}_R2_fastqc.zip"),"{sample}_R2.log")
    resources:
        mem=1000,time=30
    singularity:
        "https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    wrapper:
        "0.66.0/bio/fastqc"

nohup.out中报告了错误

Building DAG of jobs...
Pulling singularity image https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0.
CreateCondaEnvironmentException:
The 'conda' command is not available inside your singularity container image. Snakemake mounts your conda installation into singularity. Sometimes,this can fail because of shell restrictions. It has been tested to work with docker://ubuntu,but it e.g. fails with docker://bash 
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py",line 247,in create
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py",line 381,in __new__
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py",line 394,in __init__
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/deployment/conda.py",line 417,in _check

使用shell: 代替wrapper:

我将包装器改回了shell命令:

这是我用``:

提交时遇到的错误
orkflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
        1   qc_before_trim_r2
        1

[Mon Sep 21 16:32:54 2020]
Job 0: --- Quality check of raw data with FastQC before trimming.

Activating singularity image /home/moldach/wrappers/SUBSET/VEP/.snakemake/singularity/6740cb07e67eae01644839c9767bdca5.simg
^[[33mWARNING:^[[0m Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
perl: warning: Setting locale Failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),LC_ALL = (unset),LANG = "en_CA.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Skipping '/home/moldach/wrappers/SUBSET/MTG324_SUBSET/MTG324_R2.fastq.gz' which didn't exist,or Couldn't be read
Waiting at most 60 seconds for missing files.
MissingOutputException in line 84 of /home/moldach/wrappers/SUBSET/VEP/Snakefile:
Job completed successfully,but some output files are missing. Missing files after 60 seconds:
qc/fastQC/before_trim/MTG324_R2_fastqc.html
qc/fastQC/before_trim/MTG324_R2_fastqc.zip
This might be due to filesystem latency. If that is the case,consider to increase the wait time with --latency-wait.
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py",line 544,in handle_job_success
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py",line 231,in handle_job_success
Shutting down,this might take some time.
Exiting because a job execution Failed. Look above for error message

错误Skipping '/home/moldach/wrappers/SUBSET/MTG324_SUBSET/MTG324_R2.fastq.gz' which didn't exist,or Couldn't be read具有误导性,因为该文件确实存在...

更新2

遵循建议Manavalan Gajapathy,我消除了在两个不同级别(全局+每个规则)上定义奇点的问题。

现在,我仅在全局级别使用单个容器,并通过--use-conda使用包装器,这将在容器内部创建conda环境:

# Directories------------------------------------------------------------------
configfile: "config.yaml"

# Setting the names of all directories
dir_list = ["REF_DIR",sample_locations))
# get number of samples
len_samples = len(sample_names)


# Singularity with conda wrappers

singularity: "docker://continuumio/miniconda3:4.5.11"

# Rules -----------------------------------------------------------------------

rule all:
    input:
    "resources/vep/plugins",'zip'])

rule download_vep_plugins:
    output:
    directory("resources/vep/plugins")
    params:
    release=100
    resources:
    mem=1000,time=30
    wrapper:
    "0.66.0/bio/vep/plugins"

rule get_vep_cache:
    output:
    directory("resources/vep/cache")
    params:
    species="caenorhabditis_elegans",release="100"
    resources:
    mem=1000,time=30
    log:
        "logs/vep/cache.log"
    cache: True  # save space and time with between workflow caching (see docs)
    wrapper:
    "0.66.0/bio/vep/cache"

def getHome(sample):
  return(list(os.path.join(samples_dict[sample],'R2']))

rule qc_before_trim_r1:
    input:
    r1=lambda wildcards: getHome(wildcards.sample)[0]
    output:
    html=os.path.join(dirs_dict["QC_DIR"],params:
    dir=os.path.join(dirs_dict["QC_DIR"],"{sample}_R1.log")
    resources:
    mem=1000,threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    wrapper:
    "0.66.0/bio/fastqc"

rule qc_before_trim_r2:
    input:
    r1=lambda wildcards: getHome(wildcards.sample)[1]
    output:
    html=os.path.join(dirs_dict["QC_DIR"],"{sample}_R2.log")
    resources:
    mem=1000,time=30
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    wrapper:
    "0.66.0/bio/fastqc"

并通过以下方式提交:

但是,我仍然遇到错误

Workflow defines that rule get_vep_cache is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
        1   qc_before_trim_r2
        1

[Tue Sep 22 12:44:03 2020]
Job 0: --- Quality check of raw data with FastQC before trimming.

Activating singularity image /home/moldach/wrappers/SUBSET/OMG/.snakemake/singularity/d7617773b315c3abcb29e0484085ed06.simg
Activating conda environment: /home/moldach/wrappers/SUBSET/OMG/.snakemake/conda/c591f288
Skipping '/work/mtgraovac_lab/MATTS_SCRATCH/rep1_R2.fastq.gz' which didn't exist,or Couldn't be read
Skipping ' 2> logs/fastQC/rep1_R2.log' which didn't exist,or Couldn't be read
Failed to process qc/fastQC/before_trim
java.io.FileNotFoundException: qc/fastQC/before_trim (Is a directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:73)
        at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
        at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
        at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:159)
        at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
        at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)
Traceback (most recent call last):
  File "/home/moldach/wrappers/SUBSET/OMG/.snakemake/scripts/tmpiwwprg5m.wrapper.py",line 35,in <module>
    shell(
  File "/mnt/snakemake/snakemake/shell.py",line 205,in __new__
    raise sp.CalledProcessError(retcode,cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  fastqc qc/fastQC/before_trim --quiet -t 1 --outdir /tmp/tmps93snag8 /work/mtgraovac_lab/MATTS_SCRATCH/rep1_R2.fastq.gz ' 2> logs/fastQC/rep1_R$
[Tue Sep 22 12:44:16 2020]
Error in rule qc_before_trim_r2:
    jobid: 0
    output: qc/fastQC/before_trim/rep1_R2_fastqc.html,qc/fastQC/before_trim/rep1_R2_fastqc.zip
    log: logs/fastQC/rep1_R2.log (check log file(s) for error message)
    conda-env: /home/moldach/wrappers/SUBSET/OMG/.snakemake/conda/c591f288

RuleException:
CalledProcessError in line 97 of /home/moldach/wrappers/SUBSET/OMG/Snakefile:
Command ' singularity exec --home /home/moldach/wrappers/SUBSET/OMG  --bind /home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages:/mnt/snakemake /home/moldach/wrappers/SUBSET/OMG/.snakemake$
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py",line 2189,in run_wrapper
  File "/home/moldach/wrappers/SUBSET/OMG/Snakefile",line 97,in __rule_qc_before_trim_r2
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py",line 529,in _callback
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/concurrent/futures/thread.py",line 57,in run
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py",line 515,in cached_or_run
  File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py",line 2201,in run_wrapper
Shutting down,this might take some time.
Exiting because a job execution Failed. Look above for error message

可重复性

要复制此数据,可以下载此小型数据集:

git clone https://github.com/CRG-CNAG/CalliNGS-NF.git
cp CalliNGS-NF/data/reads/rep1_*.fq.gz .
mv rep1_1.fq.gz rep1_R1.fastq.gz 
mv rep1_2.fq.gz rep1_R2.fastq.gz 

更新3:绑定挂载

根据mounting上共享的链接

认情况下,奇点绑定会在运行时将/home/$USER/tmp$PWD绑定到您的容器中。”

因此,为简单起见(也因为我在使用--singularity-args时出错),我已将所需的文件移至/home/$USER并尝试从那里运行。

(snakemake) [~]$ pwd
/home/moldach


(snakemake) [~]$ ll
total 3656
drwx------ 26 moldach moldach    4096 Aug 27 17:36 anaconda3
drwx------  2 moldach moldach    4096 Sep 22 10:11 bin
-rw-------  1 moldach moldach     265 Sep 22 14:29 config.yaml
-rw-------  1 moldach moldach 1817903 Sep 22 14:29 rep1_R1.fastq.gz
-rw-------  1 moldach moldach 1870497 Sep 22 14:29 rep1_R2.fastq.gz
-rw-------  1 moldach moldach      55 Sep 22 14:29 samples.txt
-rw-------  1 moldach moldach    3420 Sep 22 14:29 Snakefile

并与bash -c "nohup snakemake --profile slurm --use-singularity --use-conda --jobs 4 &"

一起运行

但是,我仍然遇到这个奇怪的错误

Activating conda environment: /home/moldach/.snakemake/conda/fdae4f0d
Skipping ' 2> logs/fastQC/rep1_R2.log' which didn't exist,or Couldn't be read
Failed to process qc/fastQC/before_trim
java.io.FileNotFoundException: qc/fastQC/before_trim (Is a directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at uk.ac.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:73)
        at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:106)
        at uk.ac.babraham.FastQC.Sequence.SequenceFactory.getSequenceFile(SequenceFactory.java:62)
        at uk.ac.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:159)
        at uk.ac.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:121)
        at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:316)
Traceback (most recent call last):

为什么认为它被赋予目录?

注意:如果您仅使用--use-conda eg bash -c "nohup snakemake --profile slurm --use-conda --jobs 4 &"提交 ,则{ {1}}条规则。但是,单独的fastqc参数是不能100%重现的,以点为例,不适用于我在其上测试过的另一个HPC

The full log in nohup.out when using --printshellcmds can be found at this gist

解决方法

TLDR:

在qc规则中使用的fastqc奇异容器可能没有conda可用,并且不能满足snakemake的--use-conda的期望。

说明:

您有在两个不同级别定义的奇异容器-1.全局级别,将用于所有规则,除非它们在规则级别被覆盖; 2.将在规则级别使用的每个规则级别。

# global singularity container to use
singularity: "docker://continuumio/miniconda3:4.5.11"

# singularity container defined at rule level
rule qc_before_trim_r1:
    ....
    ....
    singularity:
        "https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0"

同时使用--use-singularity--use-conda时,作业将在奇异容器内的conda环境中运行。因此conda命令必须在奇异容器内可用,以使此操作成为可能。尽管您的全局级容器显然满足了此要求,但是我可以肯定(尽管尚未测试)您的fastqc容器不是这种情况。

如果提供--use-conda标志,snakemake的工作方式将根据--use-singularity标志的提供在本地或容器内部创建conda环境。由于您使用的是qc规则的snakemake-wrapper,并且预定义了conda env配方,因此,最简单的解决方案是仅对所有规则使用全局定义的miniconda容器。也就是说,不需要使用特定于fastqc的容器进行qc规则。

如果您确实要使用fastqc容器,则不应使用--use-conda标志,但这当然意味着全局或按规则定义的容器都可以使用所有必要的工具

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。