微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

比较fasta文件与充满序列ID的txt文件

如何解决比较fasta文件与充满序列ID的txt文件

我需要帮助,因为我被卡住了。 我有一个带有序列ID的txt文件,它 看起来像这样->

tr|K9RTD0|K9RTD0_SYNP3

tr|K9RSV3|K9RSV3_SYNP3

tr|K9RRE8|K9RRE8_SYNP3

tr|K9RMU9|K9RMU9_SYNP3

然后我有一个典型的fasta文件

>sp|P00115|CYC6_SYNP3 Cytochrome c6 OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=petJ PE=1 SV=2
MKTLLTILALTLVTLTTWLSTPAFAADIADGAKVFSANCAACHMGGGNVVMANKTLKKEA
LEQFGMNSADaimYQVQNGKNAMPAFGGRLSEAQIENVAAYVLDQSSKNWAG
>tr|K9RTH7|K9RTH7_SYNP3 N-acyl-D-glucosamine 2-epimerase OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=Syn6312_2130 PE=4 SV=1
MAPQINFPFSDLIAGYVTSYDTETDIFGLKTSDGREFPVKLSPMAYAKVIQNFDEGYPDA
TSTMRAWLTPGRFLFVYGVFYPDTDVFDAKQVVFAGKKEDDYVFEKQDWWIQQINALGKF
YVKAQFGQEEIDYRNYRTDLSVSGERSSVKFRQETDTISRLVYGFATAFMMTGDEVFLEA
AEKGTEYLRDHMRFVDRDEDIIYWYHGIDVQGEKELKIFASEFGDDYDAIPAYEQIYALA
GPIQTYRCTGDPRILSDAEQTIKLFDKFFLDQSEYGGYFSHIdplMLDPRSDSLGRNKAR
KNWNSVGDHAPAYLINLWLATGEQKYADMLEYTFDTIEKYFPDYENSPFVQERFYEDWSH
DTTWGWQQNRAVVGHNLKIAWNLMRMQSLKPKEQYVGLAQKIADLMPSVGSDQQRGGWSD
TVERLLTNNSKFHQFVWHDRKAWWQQEQAILAYLILGGILEHDDYHRLGREAAAFYNAWF
LDLEDGGVYFNVLANGISYLARGNERAKGSHSMSGYHSFELCYLAAVYTNFLITKHPMDF
YFKPLPNGFPDRILRVSPDILPPGSILLEsveIDGKAYTDFDSQALTVKLPETKERVKVK
VRLAPKS
>tr|K9RXQ9|K9RXQ9_SYNP3 Uncharacterized protein OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=Syn6312_3008 PE=4 SV=1
MKVEILKKRLNKECPMTTTRMPEDVIQELKQIAsllVFWGYQPLIGADIGQglrTDLEQL
EDDKVSALVASLKRHRVSDEVLQTALMETTIN

我需要比较这两个文件,并根据ID查找序列的描述并打印出来。 我的代码

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import sys

p = "proteome.fasta"
file = "reference.txt"
out = "jopik.txt"


with open(out,"w") as o:
    sys.stdout = o
    for seq_record in SeqIO.parse(open(p,mode = "r"),"fasta"):
        seq_record.description=' '.join(seq_record.description.split()[1:])
        with open(file,"r") as f:
            line = f.readlines()
            print(line)
            if (seq_record.id == line):
                    i = seq_record.description
                    print(i)

解决方法

此外,您只是缺少某种循环for x in y:,文件处理程序在Python中是可迭代的(对于非二进制模式逐行迭代),这将使您不必将整个文件加载到内存中在开始迭代之前(例如.readlines()

# load first file and create a helpful structure
compare_dict = {}
with open("reference.txt") as fh:
    for line in fh:
        if line:  # throw out empty lines,could do a stricter compare
            compare_dict[line.strip()] = None

# form a tuple of possible prefixes
compare_tuple = tuple(">" + a for a in compare_dict.keys())

with open("proteome.fasta") as fh:
    for line_no,line in enumerate(fh,1):  # lines start at 1,not 0
        if line.startswith(compare_tuple)
            key,value = line.split(" ",1)
            key = key[1:]  # strip ">" from prefix
            compare_dict[key] = value
            print("found {} on L{}: {}".format(key,line_no,value))

# optionally display keys which were not in your .fasta file
for key,value in compare_dict.items():
    if value is None:
        print("failed to find a definition for {}".format(key))

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。