pandas 按范围合并间隔

Pra*_*ras 5 python bioinformatics pandas

我有一个 pandas 数据框,如下所示:

  chrom  start  end  probability   read
0  chr1      1   10         0.99  read1
1  chr1      5   25         0.99  read2
2  chr1     15   25         0.99  read2
3  chr1     30   40         0.75  read4
Run Code Online (Sandbox Code Playgroud)

我想要做的是合并具有相同染色体(chrom 列)且坐标(开始,结束)重叠的间隔。在某些情况下,如果多个间隔彼此重叠,则即使它们不重叠,也会存在应该合并的间隔。请参阅上述示例中的第 0 行和第 2 行以及下面的合并输出

对于那些合并的元素,我想对它们的概率(概率列)进行求和,并计算“读取”列中的唯一元素。

使用上面的示例将导致以下输出,请注意行 0,1 和 2 已合并:

 chrom  start  end  probability  read
0  chr1      1   20         2.97     2
1  chr1     30   40         0.75     1
Run Code Online (Sandbox Code Playgroud)

到目前为止,我一直在使用 pybedtools merge 来执行此操作,但事实证明,执行数百万次(我的情况)时速度很慢。因此,我正在寻找其他选择,而 pandas 是显而易见的选择。我知道使用 pandas groupby可以对要合并的列应用不同的操作,例如nuniquesum,这是我需要应用的操作。尽管如此,pandas groupby 仅合并具有精确“chrom”、“start”和“end”坐标的数据。

我的问题是我不知道如何使用 pandas 根据坐标(chrom、start、end)合并行,然后应用求和nunique操作

有没有快速的方法来做到这一点?

谢谢!

PS:正如我在问题中所说的那样,我已经这样做了数百万次,所以速度是一个大问题。因此,我无法使用 pybedtools 或纯 python,它们对于我的目标来说太慢了。

谢谢!

The*_*Cat 2

这是使用pyranges和pandas的答案。它的改进之处在于它的合并速度非常快,很容易并行化,即使在单核模式下也非常快。

设置:

import pandas as pd
import pyranges as pr
import numpy as np

rows = int(1e7)
gr = pr.random(rows)
gr.probability = np.random.rand(rows)
gr.read = np.arange(rows)
print(gr)

# +--------------+-----------+-----------+--------------+----------------------+-----------+
# | Chromosome   | Start     | End       | Strand       | probability          | read      |
# | (category)   | (int32)   | (int32)   | (category)   | (float64)            | (int64)   |
# |--------------+-----------+-----------+--------------+----------------------+-----------|
# | chr1         | 149953099 | 149953199 | +            | 0.7536048547309669   | 0         |
# | chr1         | 184344435 | 184344535 | +            | 0.9358130407479777   | 1         |
# | chr1         | 238639916 | 238640016 | +            | 0.024212603310159064 | 2         |
# | chr1         | 95180042  | 95180142  | +            | 0.027139751993808026 | 3         |
# | ...          | ...       | ...       | ...          | ...                  | ...       |
# | chrY         | 34355323  | 34355423  | -            | 0.8843190383030953   | 999996    |
# | chrY         | 1818049   | 1818149   | -            | 0.23138017743097572  | 999997    |
# | chrY         | 10101456  | 10101556  | -            | 0.3007915302642412   | 999998    |
# | chrY         | 355910    | 356010    | -            | 0.03694752911338561  | 999999    |
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# Stranded PyRanges object has 1,000,000 rows and 6 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
Run Code Online (Sandbox Code Playgroud)

执行:

def praderas(df):
    grpby = df.groupby("Cluster")
    prob = grpby.probability.sum()
    prob.name = "ProbSum"
    n = grpby.read.count()
    n.name = "Count"

    return df.merge(prob, on="Cluster").merge(n, on="Cluster")

%time result = gr.cluster().apply(praderas)
# 11.4s !
result[result.Count > 2]
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# | Chromosome   | Start     | End       | Strand       | probability          | read      | Cluster   | ProbSum            | Count     |
# | (category)   | (int32)   | (int32)   | (category)   | (float64)            | (int64)   | (int32)   | (float64)          | (int64)   |
# |--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------|
# | chr1         | 52952     | 53052     | +            | 0.7411051557901921   | 59695     | 70        | 2.2131010082513884 | 3         |
# | chr1         | 52959     | 53059     | +            | 0.9979036360671423   | 356518    | 70        | 2.2131010082513884 | 3         |
# | chr1         | 53029     | 53129     | +            | 0.47409221639405397  | 104776    | 70        | 2.2131010082513884 | 3         |
# | chr1         | 64657     | 64757     | +            | 0.32465233067499366  | 386140    | 88        | 1.3880589602361695 | 3         |
# | ...          | ...       | ...       | ...          | ...                  | ...       | ...       | ...                | ...       |
# | chrY         | 59356855  | 59356955  | -            | 0.3877207561218887   | 9966373   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356865  | 59356965  | -            | 0.4007557656399032   | 9907364   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356932  | 59357032  | -            | 0.33799123310907786  | 9978653   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356980  | 59357080  | -            | 0.055686136451676305 | 9994845   | 8502533   | 1.182153891322546  | 4         |
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# Stranded PyRanges object has 606,212 rows and 9 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
Run Code Online (Sandbox Code Playgroud)