如何解决根据多个 VCF 文件中的列计算唯一记录的总数
我有大约 200 个文件,它们的标题行很长,以“#”开头,然后记录 4 列,如下所示:
file_1.vcf
##some_comments that span many lines
##some_comments that span many lines
#CRHOM POS REF ALT
chr1 111 A G
chr2 222 G T
chrY 99999 A C
file_2.vcf
##some_comments that span many lines
##some_comments that span many lines
#CRHOM POS REF ALT
chr2 444 G T
chr2 222 G T
chrY 7473 C A
对于这个只有 2 vcf 的简单示例,预期的记录总数为 5 条记录。
每个文件都包含一百万左右的记录,我正在尝试计算它们总共有多少条记录,条件是出现在任何其他 vcf 文件中的重复记录必须只计算一次。我怎样才能做到这一点,我在 Python 中尝试过,但是这卡住了很长时间:
import os
from pathlib import Path
import vcf
path_gdc_data = Path(os.getcwd()+"/all_vcfs/")
non_dup = []
count += 1
for i,vcf_file in enumerate(path_gdc_data.rglob("*.vcf")):
vcf_reader = vcf.Reader(filename=str(vcf_file))
for rec in vcf_reader:
if (rec.CHROM,rec.POS,rec.REF,rec.ALT) not in indels_coord_uniq:
non_dup_records.append((rec.CHROM,rec.POS))
count += 1
print(f"Total num of non dup in all vcf are {count}")
注意 1:我在我的问题中包含了 pandas 和 Dataframe 标签,以查看 pandas 是否是更有效的解决方案。
示例 vcf 文件:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership,build 129">
##INFO=<ID=H2,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Description="Genotype">
##FORMAT=<ID=GQ,Description="Genotype Quality">
##FORMAT=<ID=DP,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
非常感谢。
解决方法
如果 vcf 文件已排序并且一个文件内没有唯一项,则可以使用 heapq.merge
将所有文件合并在一起,然后使用 itertools.groupby
将它们分组(我假设我们要计算所有列上的唯一值)。然后计算唯一组的数量。示例
import vcf
from heapq import merge
from itertools import groupby
filelist = ["file_1.vcf","file_2.vcf"] # <--- create the list from `glob()` for example
vcfs = [vcf.Reader(filename=f) for f in filelist]
uniq = sum(1 for _ in groupby(merge(*vcfs)))
print("Total number of unique records:",uniq)
打印(在我的示例中):
Total number of unique records: 7
编辑:仅在 CHROM、POS、REF、ALT 列上计算唯一值:
import vcf
from heapq import merge
from itertools import groupby
filelist = ["file_1.vcf","file_2.vcf"]
vcfs = [
((row.CHROM,row.POS,row.REF,row.ALT) for row in vcf.Reader(filename=f))
for f in filelist
]
uniq = sum(1 for _ in groupby(merge(*vcfs)))
print("Total number of unique records:",uniq)
,
另一种使用 Pandas
的方法:
import pandas as pd
import io
from pathlib import Path
COLS = ["CRHOM","POS","REF","ALT"]
def read_vcf(filepath_or_buffer,usecols=None):
with open(filepath_or_buffer) as vcf:
while True:
line = vcf.readline()
if not line.startswith("##"):
names = line[1:].split()
break
return pd.read_table(vcf,sep="\t",delim_whitespace=True,usecols=usecols,header=None,names=names)
# Get vcf file list and load the first to a dataframe
path_gdc_data = Path().cwd() / "all_vcfs"
vcf_files = path_gdc_data.glob("*.vcf")
# df = read_vcf(next(vcf_files),usecols=COLS)
idx = read_vcf(next(vcf_files),usecols=COLS).set_index(COLS).index
# Compare and merge other dataframes
for vcf in vcf_files:
# df1 = read_vcf(vcf,usecols=COLS)
idx1 = read_vcf(vcf,usecols=COLS).set_index(COLS).index
# df = df.merge(df1,on=COLS,how="outer")
idx = idx.union(idx1)
# print(f"Total num of non dup in all vcf are {len(df)}")
print(f"Total num of non dup in all vcf are {len(idx)}")
示例:
file_1.vcf
##some_comments that span many lines
##some_comments that span many lines
#CRHOM POS REF ALT
chr1 111 A G
chr2 222 G T
chrY 99999 A C
file_2.vcf
##some_comments that span many lines
##some_comments that span many lines
#CRHOM POS REF ALT
chr2 444 G T
chr2 222 G T
chrY 7473 C A
file_3.vcf
##some_comments that span many lines
##some_comments that span many lines
#CRHOM POS REF ALT
chr2 444 G T
chr3 333 G T
chrY 7473 C T
在上面的代码上应用这个修改并运行:
26c26
< df = df.merge(df1,how="outer")
---
> df = df.merge(df1,how="outer",indicator=f"{vcf.stem}")
>>> df
CRHOM POS REF ALT file_2 file_3
0 chr1 111 A G left_only left_only
1 chr2 222 G T both left_only
2 chrY 99999 A C left_only left_only
3 chr2 444 G T right_only both
4 chrY 7473 C A right_only left_only
5 chr3 333 G T NaN right_only
6 chrY 7473 C T NaN right_only
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。