微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

使用特定符号之间的匹配ID字符串过滤多个fasta文件

如何解决使用特定符号之间的匹配ID字符串过滤多个fasta文件

在这个问题上停留了一段时间,尽管我发现了各种各样类似的问题,但都没有完全适合我的问题或解决了这个问题。所以这是交易:

我有一个input.fasta,格式如下:

>sp|O42363|APOA1_DANRE Apolipoprotein A-I OS=Danio rerio OX=7955 GN=apoa1 PE=2 SV=1
MKFVALALTLLLALGSQANLFQADAPTQLEHYKAAALVYLNQVKDQAEKALDNLDGTDYEQYKLQLSESLTKLQEYAQTTSQALTPYAETISTQLMENTKQLRERVMTDVEDlrsKLEPHRAELYTALQKHIDEYREKLEPVFQEYSALNRQNAEQLRAKLEPLMDDIRKAFESNIEETKSKVVPMVEAVRTKLTERLEDLRTMAAPYAEEYKEQLVKAVEEAREKIAPHTQDLQTRMEPYMENVRTTFAQMYETIAKAIQA
>sp|Q90260|ASL1B_DANRE Achaete-scute homolog 1b OS=Danio rerio OX=7955 GN=ascl1b PE=2 SV=1
MEATVVATTQLTQDSFYQPFSESLEKQDRECKVLKRQRSSSPELLRCKRRLTFNGLGYTIPQQQPMAVARRNERERNRVKQVNMGFQTLRQHVPNGAANKKMSKVETlrsAVEYIRALQQLLDEHDAVSAVLQCGVPSPSVSNAYSAGPESPHSAYSSDEGSYEHLSSEEQELLDFTTWfdrYESGASMATKDWC
>sp|Q6TH01|C10_DANRE Protein C10 OS=Danio rerio OX=7955 GN=si:dkey-29f10.1 PE=2 SV=1
MASAPAQQPTLTVEQARVVLSEVIQAFSVPENAARMEEARESACNDMGKMLQLVLPVATQIQQEVIKAYGFNNEGEGVLKFARLVKMYETQDPEIAAMSVKLKsllLPPLSTPPIGsgiPTS
>sp|Q6PFL6|CCD43_DANRE Coiled-coil domain-containing protein 43 OS=Danio rerio OX=7955 GN=ccdc43 PE=2 SV=1
MAAPEQIAGEFENWLNERLDSLEVDREVYGAYILGVLQEEESDEEQKDALQGILSAFLEEETLEEVCQEILKQWTECCSRSGAKSNQADAEVQAIASLIEKQAQIVVKQKEVSEDAKKRKEAVLAQYANVTDDEDEAEEEEQVPVGIPSDKSLFKNTNVEDVLNRRKLQRDQAKEDAQKKKEQDKMQREKDKLSKQERKDKEKKRTQKGERKR
>sp|P0C7U5|C5AR1_DANRE C5a anaphylatoxin chemotactic receptor 1 OS=Danio rerio OX=7955 GN=c5ar1 PE=3 SV=1
MDDNNSDWTSYDFGNDTIPSPNEISLSHIGTRHWITLVCYGIVFLLGVPGNALVVWVTGFRMPNSVNAQWFLNLAIADLLcclSLPILMVPLAQDQHWPFgalACKLFsgiFYMMMYCSVLLLVVISLDRFLLVTKPVWCQNNRQPRQARILCFIIWILGLLGsspYFAHMEIQHHSETKTVCTGSYSSLGHAWAITIirsFLFFLLPFLIICISHWKVYHMTSSGRRQRDKSSRTLRVILALVLGFFLCWTPLH

一个ids.txt列表,其格式为:

Q90260
Q6PFL6

我想提取所有Fasta序列及其标头,其中ids.txt的ID是标头的元素。 我已经尝试过grep -w -A 2 -Ff id_list.txt input.fasta --no-group-separator > out.fasta,但这没用。

理想情况下,我想通过正则表达式来检查以|开头的每一行的两个>sp间的字符串是否匹配我的idx.txt中的任何ID。如果是这样,可以将该标头和Fasta存储在out.fasta中。

这样out.fasta看起来像这样:

>sp|Q90260|ASL1B_DANRE Achaete-scute homolog 1b OS=Danio rerio OX=7955 GN=ascl1b PE=2 SV=1
MEATVVATTQLTQDSFYQPFSESLEKQDRECKVLKRQRSSSPELLRCKRRLTFNGLGYTIPQQQPMAVARRNERERNRVKQVNMGFQTLRQHVPNGAANKKMSKVETlrsAVEYIRALQQLLDEHDAVSAVLQCGVPSPSVSNAYSAGPESPHSAYSSDEGSYEHLSSEEQELLDFTTWfdrYESGASMATKDWC
>sp|Q6PFL6|CCD43_DANRE Coiled-coil domain-containing protein 43 OS=Danio rerio OX=7955 GN=ccdc43 PE=2 SV=1
MAAPEQIAGEFENWLNERLDSLEVDREVYGAYILGVLQEEESDEEQKDALQGILSAFLEEETLEEVCQEILKQWTECCSRSGAKSNQADAEVQAIASLIEKQAQIVVKQKEVSEDAKKRKEAVLAQYANVTDDEDEAEEEEQVPVGIPSDKSLFKNTNVEDVLNRRKLQRDQAKEDAQKKKEQDKMQREKDKLSKQERKDKEKKRTQKGERKR

我很确定这可以通过awkgrep来表示,但是我对bash还是陌生的,所以现在很难。

非常感谢! :)

解决方法

使用joinsort

join -t \| -2 2 -o 2.1,2.2,2.3 <(sort ids.txt) <(sort -t \| -k2 input.fasta)

假设input.fasta中没有多余的|字符,并且输出行的顺序也不重要。

,

使用awk

awk -F'[|]' 'NR==FNR{ids[$0];next}$2 in ids' ids.txt input.fasta

说明:

我将ids.txt和input.fasta这两个文件作为输入文件传递给awk。顺序很重要。 -F'[|]'将输入字段定界符设置为|

脚本:

# NR  is the overall record (line) number
# FNR is the record (line) number in the current input file

NR==FNR {   # True as long as we are reading the first input file
    ids[$0]   # Create a key in ids for every id from ids.txt
    next      # Don't process further actions
}

# Because of the 'next' statement above,we'll reach this point only
# when reading the second input file (input.fasta)
$2 in ids   # Print the current line if the second field
            # was found in the ids lookup

输出:

>sp|Q90260|ASL1B_DANRE Achaete-scute homolog 1b OS=Danio rerio OX=7955 GN=ascl1b PE=2 SV=1
>sp|Q6PFL6|CCD43_DANRE Coiled-coil domain-containing protein 43 OS=Danio rerio OX=7955 GN=ccdc43 PE=2 SV=1

更新:事实证明,您也想在火柴下方打印。可以这样实现:

BEGIN {
    FS="[|]"
}

# NR  is the overall record (line) number
# FNR is the record (line) number in the current input file

NR==FNR {   # True as long as we are reading the first input file
    ids[$0]   # Create a key in ids for every id from ids.txt
    next      # Don't process further actions
}

# Because of the 'next' statement above,we'll reach this point only
# when reading the second input file (input.fasta)
$2 in ids  {
    # set or reset a variable p to 2 if the second field
    # was found in the ids lookup
    p = 2
}

# Decrement the variable p on every iteration and check if it
# is greater than 0 after that. If that's true,awk will print 
# the current line
p--> 0

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。