如何在输入和输出中正确使用通配符

Question

如何在输入和输出中正确使用通配符

我最近决定从snakemake开始。我在堆栈和蛇形文档中都找不到适合我需要的任何内容。我觉得我不明白一些事情，我可能需要一些解释。

我正在尝试制作一个简单的 snakemake 工作流程，该工作流程将一个 fastq 文件和一个测序摘要文件（包含有关读取的信息）作为输入，并将快速读取中的读取过滤到几个文件中（low.fastq 和 high.fastq） .

我的输入数据和我试图执行的 Snakefile 存储如下：

.
??? data
?   ??? sequencing-summary-example.txt 
?   ??? tiny-example.fastq 
??? Snakefile
??? split_fastq

Run Code Online (Sandbox Code Playgroud)

这就是我迄今为止尝试过的：

*imports*
rule targets:
    input:
        "split_fastq/low.fastq",
        "split_fastq/high.fastq"

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/{seqsum}.txt"
    output:
        "split_fastq/low.fastq",
        "split_fastq/high.fastq"
    run:
        * do the thing *

Run Code Online (Sandbox Code Playgroud)

我希望有一个目录“split_fastq”，其中包含“低”和“高”fastq。但是我得到了错误：

Building DAG of jobs...
WildcardError in line 10 of /work/sbsuser/test/roxane/alignement-ont/Snakefile:
Wildcards in input files cannot be determined from output files:
'reads'

Run Code Online (Sandbox Code Playgroud)

尽管它似乎是一个非常流行的错误，但我不确定我是否不明白如何使用通配符或者是否存在其他问题。我是否正确使用了“输入”和“输出”？

Answer 1

Col*_*lin 6

问题是您在输入中有通配符，但在输出中没有。输出中需要通配符。这样想一想，通过将通配符放在输入中，您正在创建一个规则，您打算在许多不同的 fastq 文件上单独运行该规则。但是对于每个不同的 fastq 文件，该规则的输出文件将是完全相同的文件。他们会互相覆盖！您希望将通配符合并到您的输出文件中，以便为每个可能的输入获得一个唯一的文件，例如：

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/{seqsum}.txt"
    output:
        "split_fastq/{reads}.low.fastq",
        "split_fastq/{reads}.high.fastq"
    run:
        * do the thing *

Run Code Online (Sandbox Code Playgroud)

现在tiny-example.fastq作为您的输入，您将获得tiny-example.low.fastq和tiny-example.high.fastq作为输出。如果您添加第二个 fastq 文件，您将获得该文件的不同高低输出文件。但是这条规则仍然不起作用，因为“seqsum”通配符也不是输出的一部分。在这种情况下，您可能想要做的是sequence-summary-example.txt合并 fastq 文件的名称，例如将其称为sequence-summary-tiny-example.txt. 现在你可以像这样制定你的规则：

rule split_fastq:
    input:
        "data/{reads}.fastq",
        "data/sequence-summary-{reads}.txt"
    output:
        "split_fastq/{reads}.low.fastq",
        "split_fastq/{reads}.high.fastq"
    run:
        * do the thing *

Run Code Online (Sandbox Code Playgroud)

现在如果你再添加一个other-example.fastqand sequence-summary-other-example.txt，你的蛇形管道应该能够创建other-example.low.fastq和other-example.high.fastq。

Snakemake 总是从我们倾向于思考的方式倒退。我们首先考虑输入，然后考虑它创建的输出。但是 Snakemake 知道它需要制作什么文件，并且它试图弄清楚它需要什么输入来制作它。所以在你的原始规则中，它知道它需要 make low.fastq，并且它看到split_fastq规则可以做到这一点，但是它不知道输入中的通配符“读取”应该是什么。现在，在新规则中，它知道它需要 maketiny-example.low.fastq并且看到split_fastq可以创建模板的输出文件{reads}.low.fastq，所以它说“嘿，如果我 make reads = tiny-example，那么我可以使用这个规则！” 然后它查看输入并说“好的，因为对于我需要的输入{reads}.fastq，我知道reads = tiny-example这意味着对于我需要的输入tiny-example.fastq，我有！”

归档时间：	6 年，3 月前
查看次数：	960 次
最近记录：	6 年，3 月前