根据多个 VCF 文件中的列计算唯一记录的总数

如何解决根据多个 VCF 文件中的列计算唯一记录的总数

我有大约 200 个文件，它们的标题行很长，以“#”开头，然后记录 4 列，如下所示：

file_1.vcf
##some_comments that span many lines
##some_comments that span many lines
#CRHOM    POS    REF    ALT
chr1   111    A       G
chr2   222    G       T
chrY   99999  A       C

file_2.vcf
##some_comments that span many lines
##some_comments that span many lines
#CRHOM    POS    REF    ALT
chr2   444    G       T
chr2   222    G       T
chrY   7473  C       A

对于这个只有 2 vcf 的简单示例，预期的记录总数为 5 条记录。

每个文件都包含一百万左右的记录，我正在尝试计算它们总共有多少条记录，条件是出现在任何其他 vcf 文件中的重复记录必须只计算一次。我怎样才能做到这一点，我在 Python 中尝试过，但是这卡住了很长时间：

    import os
    from pathlib import Path
    import vcf

    path_gdc_data = Path(os.getcwd()+"/all_vcfs/")

    non_dup = []
    count += 1
    for i,vcf_file in enumerate(path_gdc_data.rglob("*.vcf")):
        vcf_reader = vcf.Reader(filename=str(vcf_file))
        for rec in vcf_reader:
            if (rec.CHROM,rec.POS,rec.REF,rec.ALT) not in indels_coord_uniq:
                    non_dup_records.append((rec.CHROM,rec.POS))
                    count += 1
    print(f"Total num of non dup in all vcf are {count}")

注意 1：我在我的问题中包含了 pandas 和 Dataframe 标签，以查看 pandas 是否是更有效的解决方案。

注2：每个单独的文件不能有重复的记录，每个文件都是排序的。

示例 vcf 文件：

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership,build 129">
##INFO=<ID=H2,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Description="Genotype">
##FORMAT=<ID=GQ,Description="Genotype Quality">
##FORMAT=<ID=DP,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

非常感谢。

解决方法

如果 vcf 文件已排序并且一个文件内没有唯一项，则可以使用 heapq.merge 将所有文件合并在一起，然后使用 itertools.groupby 将它们分组（我假设我们要计算所有列上的唯一值）。然后计算唯一组的数量。示例

import vcf
from heapq import merge
from itertools import groupby

filelist = ["file_1.vcf","file_2.vcf"]  # <--- create the list from `glob()` for example
vcfs = [vcf.Reader(filename=f) for f in filelist]

uniq = sum(1 for _ in groupby(merge(*vcfs)))

print("Total number of unique records:",uniq)

打印（在我的示例中）：

Total number of unique records: 7

编辑：仅在 CHROM、POS、REF、ALT 列上计算唯一值：

import vcf
from heapq import merge
from itertools import groupby

filelist = ["file_1.vcf","file_2.vcf"]
vcfs = [
    ((row.CHROM,row.POS,row.REF,row.ALT) for row in vcf.Reader(filename=f))
    for f in filelist
]

uniq = sum(1 for _ in groupby(merge(*vcfs)))

print("Total number of unique records:",uniq)

另一种使用 Pandas 的方法：

import pandas as pd
import io
from pathlib import Path

COLS = ["CRHOM","POS","REF","ALT"]

def read_vcf(filepath_or_buffer,usecols=None):
    with open(filepath_or_buffer) as vcf:
        while True:
            line = vcf.readline()
            if not line.startswith("##"):
                names = line[1:].split()
                break
        return pd.read_table(vcf,sep="\t",delim_whitespace=True,usecols=usecols,header=None,names=names)


# Get vcf file list and load the first to a dataframe
path_gdc_data = Path().cwd() / "all_vcfs"
vcf_files = path_gdc_data.glob("*.vcf")
# df = read_vcf(next(vcf_files),usecols=COLS)
idx = read_vcf(next(vcf_files),usecols=COLS).set_index(COLS).index


# Compare and merge other dataframes
for vcf in vcf_files:
    # df1 = read_vcf(vcf,usecols=COLS)
    idx1 = read_vcf(vcf,usecols=COLS).set_index(COLS).index
    # df = df.merge(df1,on=COLS,how="outer")
    idx = idx.union(idx1)

# print(f"Total num of non dup in all vcf are {len(df)}")
print(f"Total num of non dup in all vcf are {len(idx)}")

示例：

file_1.vcf
##some_comments that span many lines
##some_comments that span many lines
#CRHOM    POS    REF    ALT
chr1   111    A       G
chr2   222    G       T
chrY   99999  A       C

file_2.vcf
##some_comments that span many lines
##some_comments that span many lines
#CRHOM    POS    REF    ALT
chr2   444    G       T
chr2   222    G       T
chrY   7473  C       A

file_3.vcf
##some_comments that span many lines
##some_comments that span many lines
#CRHOM    POS    REF    ALT
chr2   444    G       T
chr3   333    G       T
chrY   7473  C       T

在上面的代码上应用这个修改并运行：

26c26
<     df = df.merge(df1,how="outer")
---
>     df = df.merge(df1,how="outer",indicator=f"{vcf.stem}")

>>> df
  CRHOM    POS REF ALT      file_2      file_3
0  chr1    111   A   G   left_only   left_only
1  chr2    222   G   T        both   left_only
2  chrY  99999   A   C   left_only   left_only
3  chr2    444   G   T  right_only        both
4  chrY   7473   C   A  right_only   left_only
5  chr3    333   G   T         NaN  right_only
6  chrY   7473   C   T         NaN  right_only

根据多个 VCF 文件中的列计算唯一记录的总数

如何解决根据多个 VCF 文件中的列计算唯一记录的总数

解决方法

相关推荐