微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

查找出现两次的模式,每个模式允许 <=2 个不匹配

如何解决查找出现两次的模式,每个模式允许 <=2 个不匹配

我有一个 400,000 次读取的 fastq 文件(所以速度很重要)。在序列中集成了应该出现两次的条形码。给定一个条形码,我想找到条形码出现两次且

TATCTTGTGGAAAGGACGAAACACCGAACACAAAGCATAGATGCGTTTAAGAGCTATGCTGGAAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCCTTTTTTTATTCGACCGATAGGGGTGGCAGGGGAGGCCGAGGAGTGGCAGGGAGAGGCGAGGAGTGAGCGAAAGAGGGTGACTGAACTGAACTGAACTGAAAGTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCCTTTTTTT TATCTTGTGGAAAGGACGAAACACCGGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTTTATTCGACCGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGGAAGGGAGGAGGAGTGATGGCCGAGGGAGGATGAACTGACCT TATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTATTCGACCGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGAGAGAGGGAGGAGGAGGAAGCGAGGGAGGATGAACTGAACTGAGTGGATGAACTGAAA TATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTATTCGACGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGAGAGAGGGAGGAGGAGGAAGAGAGGGAGGATGAACTGAAAGGCAGTGGATGAACTGAAA

请注意,第四个序列中的第一个条形码缺少一个字符。我已经尝试过 biopython 和 regex 但它太慢了,因为我有 5k 条码。我想知道在 python 或 grep、awk 或其他任何东西中是否有可用的快速解决方案。谢谢。

解决方法

使用 GNU awk:

 awk '{ for (i=1;i<=NF;i++) { fnd=0;subs=$i;while (match(subs,"ATTCGACCGATAGG")) { subs=substr(subs,RSTART+RLENGTH);if (RSTART>0) { fnd++;print fnd } } if (fnd <=2) { print $i } } }' file

说明:

 awk '{ for (i=1;i<=NF;i++) {                           # Loop on each space delimited field
         fnd=0;                                         # Initialise fnd variable/counter
         subs=$i;                                       # Initialise substring variable
         while (match(subs,"ATTCGACCGATAGG")) { 
           subs=substr(subs,RSTART+RLENGTH);            # Check for multiple matches of "ATTCGACCGATAGG" in subs.
           if (RSTART>0) { 
              fnd++;                                    # Increment fnd if string found in subs
           } 
         } 
         if (fnd <=2) { 
            print $i                                    # If found twice or less than twice print the field
         }
        } 
       }' file

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。