如何解决解析两个文件以合并数据并创建新的Fasta文件
我有两个文件: human.fa 和 protein-coding_gene.txt (有数百种不同的蛋白质信息)。我必须先解析蛋白质编码基因,然后再解析human.fa(10个蛋白质名称),然后将其合并到一个新的fasta文件中。
protein-coding_gene.txt:
Protein1 PreviousNames1 PreviousSymbols1 Symbol1 Chromosome1
Protein2 PreviousNames2 PreviousSymbols2 Symbol2 Chromosome2
human.fa:
>Protein1 Sequence1
>Protein2 Sequence2
我需要一个新的fasta文件来输出:
>Protein1 Synonyms1 Chromsome1 Sequence1
>Protein2 Synonyms2 Chromosome2 Sequence2
我当前的代码是:
class Protein:
def __init__(self,Name,Synonyms,Chromosome):
self.Name = Name
self.Synonyms = Synonyms
self.Chromosome = Chromosome
Proteins = []
with open('protein-coding_gene.txt','r') as file:
for line in file:
parseline = line.rstrip().split("\t")
Name = parseline[2]
Synonyms = parseline[6]
Chromosome = parseline[7]
Proteins.append(Protein(Name,Chromosome))
f = open("human.fa")
seqs = {}
for i in f:
line = i.strip()
if line[0] == '>':
l = line.split()
gene = l[0][1:]
seqs[gene] = ''
else:
seqs[gene] = seqs[gene] + line
f.close()
for p in Proteins:
print(p.Name,p.Synonyms,p.Chromosome,sep=",")
for name,seq in seqs.items():
print (name,seq)
from Bio import SeqIO
newhuman = []
SeqIO.write[newhuman,"fastaML.fa","fasta")
现在,它会打印我想要的蛋白质编码文件(名称,同义词,染色体)中的所有内容,并打印整个human.fa文件。我需要它来进行排序,并且仅使用来自protein-coding_gene.txt和序列的信息打印出fasta文件的10个蛋白质名称。任何帮助将不胜感激。
解决方法
您想要的格式不是有效的Fasta格式。但是,如果您仍希望在fastaML.fa
中使用相同的输出,则不应使用SeqIO.write()方法。而是应该使用基本的文件处理。
class Protein:
def __init__(self,Name,Synonyms,Chromosome):
self.Name = Name
self.Synonyms = Synonyms
self.Chromosome = Chromosome
def add_sequence(self,Sequence):
self.Sequence = Sequence
Proteins = []
with open('protein-coding_gene.txt','r') as file:
for line in file:
parseline = line.rstrip().split(" ")
Name = parseline[0]
Synonyms = parseline[1:4]
Chromosome = parseline[4]
Proteins.append(Protein(">"+Name,Chromosome))
f = open("human.fa")
seqs = {}
gene = ""
for i in f:
line = i.strip()
if line[0] == '>':
l = line.split()
gene = l[0]
seqs[gene] = l[1]
else:
seqs[gene] = seqs[gene] + line
f.close()
for p in Proteins:
for name,seq in seqs.items():
if(p.Name == name):
p.add_sequence(seq)
with open('fastaML.fa','w') as file:
for p in Proteins:
file.write(p.Name + " " + p.Synonyms[0] + " " + p.Synonyms[1] + " " + p.Synonyms[2] + " " + p.Chromosome + " " + p.Sequence + "\n")
#I have used single space here. You can modify it as per your need.
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。