微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

通过坐标计算基因长度

如何解决通过坐标计算基因长度

我从我的同事那里收到了带有坐标的ousends基因清单。它看起来像 这个:

NPHP4   Nephronophthisis 4,606966 (3),Autosomal recessive; Senior-Loken syndrome 4,606996 (3),Autosomal recessive   1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN    Deafness,autosomal recessive 36,609006 (3),Autosomal recessive; Deafness,neurosensory,without vestibular involvement,autosomal dominant (3)       1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease,recessive intermediate C,615376 (3),Autosomal recessive; Spinal muscular atrophy,distal,autosomal recessive,4,611067 (3),Autosomal recessive        1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7   Parkinson disease 7,autosomal recessive early-onset,606324 (3),Autosomal recessive   1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036

第三列中的坐标以染色体编号开头,然后以开始位置和结束位置开头,并以“:”分隔。如果一个基因有多个区域,则用“,”分隔:

1:6021825:6022054,1:6029125:6029336
1:6508675:6509172,1:6511860:6512157
1:6529360:6529539,1:6557350:6557420
1:8025355:8025499,1:8030930:8031036

我需要计算区域的长度,即每个基因(每行)的末端位置和起始位置之间的差异(或它们的和,如果一个基因有多个区域),但是所有行中的区域数都不相同。我试图在Excel中对此进行计数,但是片段的数量太大,在某些情况下甚至无法显示。有什么方法可以使用某些正则表达式为每一行计算该值?

我希望输出作为第四列。例如,如果第三列:

1:1167623:1168684

我希望:

1:1167623:1168684 1061

如果该列具有坐标:

1:11907145:11907520,1:11906035:11906116,1:11907590:11907770

我希望:

1:11907145:11907520,1:11907590:11907770 636

非常感谢

解决方法

使用python可以相当简单地做到这一点。我在下面提供了注释代码。

Python 3.8.3 (default,Jul  2 2020,17:30:36) [MSC v.1916 64 bit (AMD64)]
Type "copyright","credits" or "license" for more information.

IPython 7.18.1 -- An enhanced Interactive Python.

runfile('C:/Users/linde/OneDrive/Documents/A/Python/Food Allergies/Export to Excel Code.py')

Oh hay! What is your full name?james

Is there anything more you want to add?no
Great! Thanks so much! We'll serve them foods right up!
Thanks James. Hope you enjoy the food!
Traceback (most recent call last):

  File "C:\Users\linde\OneDrive\Documents\A\Python\Food Allergies\Export to Excel Code.py",line 1,in <module>
    import xlsxwriter

  File "C:\Users\linde\anaconda3\lib\site-packages\xlsxwriter\__init__.py",line 3,in <module>
    from .workbook import Workbook

  File "C:\Users\linde\anaconda3\lib\site-packages\xlsxwriter\workbook.py",line 23,in <module>
    from .worksheet import Worksheet

ImportError: cannot import name 'Worksheet' from 'xlsxwriter.worksheet' (C:\Users\linde\anaconda3\lib\site-packages\xlsxwriter\worksheet.py)

输出为

d = """\
NPHP4   Nephronophthisis 4,606966 (3),Autosomal recessive; Senior-Loken syndrome 4,606996 (3),Autosomal recessive   1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN    Deafness,autosomal recessive 36,609006 (3),Autosomal recessive; Deafness,neurosensory,without vestibular involvement,autosomal dominant (3)       1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease,recessive intermediate C,615376 (3),Autosomal recessive; Spinal muscular atrophy,distal,autosomal recessive,4,611067 (3),Autosomal recessive        1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7   Parkinson disease 7,autosomal recessive early-onset,606324 (3),Autosomal recessive   1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
FOOBAR 1:11907145:11907520,1:11906035:11906116,1:11907590:11907770
"""

gene_rows = d.splitlines()

for gene_row in gene_rows:
    # Name like "NPHP4"
    gene_name = gene_row.split()[0]
    # List like ["1:6021825:6022054","1:6008105:6008352",...]
    regions = gene_row.split()[-1].split(",")
    # Counter to hold our total gene length.
    gene_length = 0
    for region in regions:
        # Split "1:6021825:6022054" into "1","6021825",and "6022054"
        chromosome,start,end = region.split(":")
        # Update the gene length counter with this region's length.
        region_length = int(end) - int(start)
        gene_length += region_length
    print(gene_name,gene_length)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?