如何解决通过坐标计算基因长度
我从我的同事那里收到了带有坐标的ousends基因清单。它看起来像 这个:
NPHP4 Nephronophthisis 4,606966 (3),Autosomal recessive; Senior-Loken syndrome 4,606996 (3),Autosomal recessive 1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN Deafness,autosomal recessive 36,609006 (3),Autosomal recessive; Deafness,neurosensory,without vestibular involvement,autosomal dominant (3) 1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease,recessive intermediate C,615376 (3),Autosomal recessive; Spinal muscular atrophy,distal,autosomal recessive,4,611067 (3),Autosomal recessive 1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7 Parkinson disease 7,autosomal recessive early-onset,606324 (3),Autosomal recessive 1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
第三列中的坐标以染色体编号开头,然后以开始位置和结束位置开头,并以“:”分隔。如果一个基因有多个区域,则用“,”分隔:
1:6021825:6022054,1:6029125:6029336
1:6508675:6509172,1:6511860:6512157
1:6529360:6529539,1:6557350:6557420
1:8025355:8025499,1:8030930:8031036
我需要计算区域的长度,即每个基因(每行)的末端位置和起始位置之间的差异(或它们的和,如果一个基因有多个区域),但是所有行中的区域数都不相同。我试图在Excel中对此进行计数,但是片段的数量太大,在某些情况下甚至无法显示。有什么方法可以使用某些正则表达式为每一行计算该值?
我希望输出作为第四列。例如,如果第三列:
1:1167623:1168684
我希望:
1:1167623:1168684 1061
如果该列具有坐标:
1:11907145:11907520,1:11906035:11906116,1:11907590:11907770
我希望:
1:11907145:11907520,1:11907590:11907770 636
非常感谢
解决方法
使用python可以相当简单地做到这一点。我在下面提供了注释代码。
Python 3.8.3 (default,Jul 2 2020,17:30:36) [MSC v.1916 64 bit (AMD64)]
Type "copyright","credits" or "license" for more information.
IPython 7.18.1 -- An enhanced Interactive Python.
runfile('C:/Users/linde/OneDrive/Documents/A/Python/Food Allergies/Export to Excel Code.py')
Oh hay! What is your full name?james
Is there anything more you want to add?no
Great! Thanks so much! We'll serve them foods right up!
Thanks James. Hope you enjoy the food!
Traceback (most recent call last):
File "C:\Users\linde\OneDrive\Documents\A\Python\Food Allergies\Export to Excel Code.py",line 1,in <module>
import xlsxwriter
File "C:\Users\linde\anaconda3\lib\site-packages\xlsxwriter\__init__.py",line 3,in <module>
from .workbook import Workbook
File "C:\Users\linde\anaconda3\lib\site-packages\xlsxwriter\workbook.py",line 23,in <module>
from .worksheet import Worksheet
ImportError: cannot import name 'Worksheet' from 'xlsxwriter.worksheet' (C:\Users\linde\anaconda3\lib\site-packages\xlsxwriter\worksheet.py)
输出为
d = """\
NPHP4 Nephronophthisis 4,606966 (3),Autosomal recessive; Senior-Loken syndrome 4,606996 (3),Autosomal recessive 1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN Deafness,autosomal recessive 36,609006 (3),Autosomal recessive; Deafness,neurosensory,without vestibular involvement,autosomal dominant (3) 1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease,recessive intermediate C,615376 (3),Autosomal recessive; Spinal muscular atrophy,distal,autosomal recessive,4,611067 (3),Autosomal recessive 1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7 Parkinson disease 7,autosomal recessive early-onset,606324 (3),Autosomal recessive 1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
FOOBAR 1:11907145:11907520,1:11906035:11906116,1:11907590:11907770
"""
gene_rows = d.splitlines()
for gene_row in gene_rows:
# Name like "NPHP4"
gene_name = gene_row.split()[0]
# List like ["1:6021825:6022054","1:6008105:6008352",...]
regions = gene_row.split()[-1].split(",")
# Counter to hold our total gene length.
gene_length = 0
for region in regions:
# Split "1:6021825:6022054" into "1","6021825",and "6022054"
chromosome,start,end = region.split(":")
# Update the gene length counter with this region's length.
region_length = int(end) - int(start)
gene_length += region_length
print(gene_name,gene_length)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。