微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Pyparsing:提取可变长度,可变内容,可变空白子字符串

如何解决Pyparsing:提取可变长度,可变内容,可变空白子字符串

这是提取患者数据和任何匹配的Gleason数据的示例。

from pyparsing import *
num = Word(nums)
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
gleason = Group("GLEASON" + Optional("score:") + num("left") + "+" + num("right") + "=" + num("total"))
assert 'GLEASON 5+4=9' == gleason
assert 'GLEASON score:  3 + 3 = 6' == gleason

patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
assert '01/02/11  S11-4444 20/111-22-3333' == patientData

partMatch = patientData("patientData") | gleason("gleason")

lastPatientData = None
for match in partMatch.searchString(data):
    if match.patientData:
        lastPatientData = match
    elif match.gleason:
        if lastPatientData is None:
            print "bad!"
            continue
        print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
                        lastPatientData.patientData, match.gleason
                        )

印刷品:

01/01/11: S11-55555 20/444-55-6666 Gleason(5+4=9)
01/02/11: S11-4444 20/111-22-3333 Gleason(3+3=6)

解决方法

我需要从前列腺切除术最终诊断报告的平面文件中提取格里森分数。这些分数始终包含格里森(Gleason)字和两个数字,这些数字加起来等于另一个数字。在过去的二十多年中,人类一直在打字。包括空格和修饰符的各种约定。以下是到目前为止我的Backus-
Naur表格和两个示例记录。仅针对前列腺切除术,我们正在研究一千多例病例。

我之所以使用pyparsing是因为我正在学习python,并且对我对正则表达式写作的有限接触并没有美好的回忆。

我的问题:如何在不解析最终诊断中可能存在或可能不存在的所有其他可选数据的情况下剔除这些格里森分数?

num = Word(nums)
record ::= accessionDate + accessionNumber + patMedicalRecordNum + finalDxText
accessionDate ::= num + "/" + num + "/" num
accessionNumber ::= "S" + num + "-" + num
patMedicalRecordNum ::= num + "/" + num + "-" + num + "-" + num
finalDxText ::= listOfParts + optionalComment + optionalpTNMStage
listOfParts ::= OneOrMore(part)
part ::= <multiline idiosyncratic freetext which may contain a Gleason score I want> + optionalpTNMStage
optionalComment ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>
optionalpTNMStage ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>


01/01/11  S11-55555 20/444-55-6666 A.  PROSTATE AND SEMINAL VESICLES,PROSTATECTOMY:                           
                                   -  ADENOCARCINOMA.

                                   TOTAL GLEASON SCORE:  GLEASON 5+4=9                                     
                                   TUMOR LOCATION:  BILATERAL                                              
                                   TUMOR QUANTITATION:  15% OF PROSTATE INVOLVED BY TUMOR                  
                                   EXTRAPROSTATIC EXTENSION:  PRESENT AT RIGHT POSTERIOR                   
                                   SEMINAL VESICLE INVASION:  PRESENT                                      
                                   MARGINS:  UNINVOLVED                                                    
                                   LYMPHOVASCULAR INVASION:  PRESENT                                       
                                   PERINEURAL INVASION:  PRESENT                                           
                                   LYMPH NODES (SPECIMENS B AND C):                                        
                                      NUMBER EXAMINED:  25                                                 
                                      NUMBER INVOLVED:  1                                                  
                                      DIAMETER OF LARGEST METASTASIS:  1.7 mm                              
                                   ADDITIONAL FINDINGS:  HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,ACUTE AND CHRONIC INFLAMMATION,INTRADUCTAL EXTENSION OF INVASIVE    
                                      CARCINOMA

                                   PATHOLOGIC STAGE:  pT3b N1 MX

                               B.  LYMPH NODES,RIGHT PELVIC,EXCISION:                                    
                                   -  ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).

                               C.  LYMPH NODES,LEFT PELVIC,EXCISION:                                     
                                   -  EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).                     
01/02/11  S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES,PROSTATECTOMY:                               
                                  - ADENOCARCINOMA.                                                        
                                    GLEASON SCORE:  3 + 3 = 6 WITH TERTIARY PATTERN OF 5.                                             
                                    TUMOR QUANTITATION:  APPROXIMATELY 10% BY VOLUME.                      
                                    TUMOR LOCATION:  BILATERAL.                                            
                                    EXTRAPROSTATIC EXTENSION:  NOT IDENTIFIED.                             
                                    MARGINS:  NEGATIVE.                                                    
                                    PERINEURAL INVASION:  IDENTIFIED.                                      
                                    LYMPH-VASCULAR INVASION:  NOT IDENTIFIED.                              
                                    SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.              
                                    LYMPH NODES:  NONE SUBMITTED.                                          
                                    OTHER:  HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.                
                               PATHOLOGIC STAGE (pTNM):  pT2c NX.

全面披露:我是一名从事研究的医师;这是我第一次使用python。我已经阅读了Lutz的《学习Python》,Shaw的《艰难的学习Python》,并研究了各种问题。我在该论坛,pyparsing
Wiki上审查了许多与pyparsing有关的问题,并且我购买并阅读了McGuire先生的“
Pyparsing入门”。也许我是在问一个问题,什么时候应该真正告诉我我站在“沮丧的死亡螺旋,当您必须编写解析器时,这种螺旋非常普遍”(McGuire,17岁)?我不知道。到目前为止,我很高兴能够从事实际上可能是一个真正的项目。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。