正则表达式和多个函数和 f 字符串的列表理解问题

如何解决正则表达式和多个函数和 f 字符串的列表理解问题

我有一个问题，实际上有两个。首先，我在 NLP 项目上工作。我需要编写一个脚本来标记单词，用 Universel POS 替换非标准 POS（词性）并用它们的标签提取名称实体。我需要把结果放在这样的文件中。有两个文件。第一个只打印带有 POS 和 NER 标签（命名实体识别）的单词。对于第一个文件，我有这种输出：

Consuela    NOUN    B-PERSON
Washington  NOUN    B-ORGANIZATION
.   .   O
a   DET O
longtime    ADJ O
House   NOUN    B-ORGANIZATION
staffer NOUN    O
and CONJ    O
an  DET O
expert  NOUN    O
in  ADP O
securities  NOUN    O
laws    NOUN    O
.   .   O
is  VERB    O
a   DET O
leading VERB    O
candidate   NOUN    O
to  PRT O
be  VERB    O
chairwoman  NOUN    O
of  ADP O
the DET O
Securities  NOUN    B-ORGANIZATION
and CONJ    O
Exchange    NOUN    B-ORGANIZATION
Commission  NOUN    I-ORGANIZATION

这正是我想要的，但现在我必须将 NER 标签转换为通用 NER 标签并创建一个带有键/值（非标准/标准）的字典。此处 POS 已转换为通用格式（NOUN、DET、ADP 等...）但是现在，当我想对我的第二个文件应用相同的方法来转换 NER 标签时，我得到了：

Consuela    NOUN    BIBIBIB-LOCLOCPERSPERSORGORGPERSON
Washington  NOUN    BIBIBIB-LOCLOCPERSPERSORGORGORGANIZATION
.   .   BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
a   DET BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
longtime    ADJ BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
House   NOUN    BIBIBIB-LOCLOCPERSPERSORGORGORGANIZATION
staffer NOUN    BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
and CONJ    BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
an  DET BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
expert  NOUN    BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
in  ADP BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
securities  NOUN    BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
laws    NOUN    BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
.   .   BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
is  VERB    BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
a   DET BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
leading VERB    BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
candidate   NOUN    BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG
to  PRT BIBIB-LOCLOCPERSPERSORGBIBIB-LOCLOCPERSPERSORGORG

这肯定不是我想要的。当我想在第一个之间更改我的字典（NER之间的对应表）时

dict_ner = {'ORG':['I-ORGANIZATION','B-ORGANIZATION','FACILITY'],'PERS':['I-PERSON','B-PERSON'],'LOC':['I-LOCATION','B-LOCATION'],'MISC':['DATE','TIME','MONEY','PERCENT'],'LOC':['I-GPE','B-GPE']}

第二个是因为我想在之前打印 I 或 B（I-PERS、B-PERS 等...）这对我的项目很重要。我使用这个字典，结果很糟糕。我的正则表达式模块构建了整个正则表达式来替换字典中的非标准标记，其中正则表达式是键，值是更改旧的值。

dict_ner = {'I-ORG':'I-ORGANIZATION','B-ORG':'B-ORGANIZATION','ORG':'FACILITY','I-PERS':'I-PERSON','B-PERS':'B-PERSON','I-LOC':'I-LOCATION','B-LOC':'B-LOCATION','B-GPE']}

这是我的代码。不要关注法国评论

# -*- coding: utf-8 -*-
#!usr/bin/env python3

from contextlib import ExitStack
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk import ne_chunk,pos_tag
from nltk.chunk import tree2conlltags
import re

# dictionnaire : table de correspondance des étiquètes NER (conll to standard)
dict_ner = {'I-ORG':'I-ORGANIZATION','B-GPE']}

# loading correspondance table between Penn Treebank POS and standard POS from file
def load_pos_table():

    try:
         
        with open('POSTags_PTB_Universal_Linux.txt','r',encoding='utf-8') as universal:

            # on commence par charger notre dictionnaire avec la table des étiquettes POS Penn Treebank,POS NLTK
            dict_pos = {}
            for sent in universal.readlines():
                for line in sent.splitlines():
                    cut = line.strip().split()
                    dict_pos[cut[1]] = dict_pos.get(cut[1],list()) + [cut[0]]

        return dict_pos

    except Exception as erreur:
        print(f'load_pos_table : {erreur}')

def convert_format(line,dic):

    try:
        rx_dctvals = {re.compile("|".join(sorted([to_regex(v) for v in val],key=len,reverse=True))):key for key,val in dic.items()}

        #Version 3.8+
        return [line := rx.sub(repl.replace('\\','\\\\'),line) for rx,repl in rx_dctvals.items()][-1]
        """
        version 3.7-
        for rx,repl in rx_dctvals.items():
            line = rx.sub(repl.replace('\\',line)
        return line
        """

    except Exception as erreur:
        print(f'convert_tag: {erreur}')

def to_regex(x):

    r = []
    if x[0].isalnum():
        r.append(r'(?<![^\W])')
    else:
        if any(l.isalnum() for l in x):
            r.append(r'\B')

    r.append(re.escape(x))
    
    if x[-1].isalnum():
        r.append(r'\b')
    else:
        if any(l.isalnum() for l in x):
            r.append(r'\B')
    return "".join(r)

def extract_entities(doc):
    return list(map(lambda sent: tree2conlltags(ne_chunk(pos_tag(word_tokenize(sent)))),sent_tokenize(doc)))

def main():

    try:

        with ExitStack() as stack:
            
            file = stack.enter_context(open('formal-tst.NE.key.04oct95_sample.txt',encoding='utf-8'))
            # fichier d'extraction des entités nommées avec étiquettes non standards 
            result_file_ner = stack.enter_context(open('wsj_0010_sample.txt.ne.nltk','w',encoding='utf-8'))
            # fichier avec étiquettes standards NER
            result_file_ner_standard = stack.enter_context(open('wsj_0010_sample.txt.ne.standard',encoding='utf-8'))

            pos_table = load_pos_table()
            content = file.read()

            [[result_file_ner.write(convert_format(f'{name}\t{tag}\t{ner}\n',pos_table)) for name,tag,ner in line] for line in extract_entities(content)]
            [[result_file_ner_standard.write(convert_format(f'{name}\t{tag}\t{ner}\n',{**pos_table,**dict_ner})) for name,ner in line] for line in extract_entities(content)]

    except Exception as error:
        print(f'main error : {error}')

if __name__ == '__main__':
    main()

此外，我不是关于这两个列表理解

            [[result_file_ner.write(convert_format(f'{name}\t{tag}\t{ner}\n',ner in line] for line in extract_entities(content)]

extract_entities(content) 返回一个包含 3 个元素的元组列表（word、POS、NER 或 not (0)）我不确定我的问题是否与我的正则表达式模块有关。我不知道。如果你能帮助我，我将不胜感激

解决方法

我找到了解决方案。当值是一个字符串时，我的列表理解通过一个字符一个字符来迭代它，因此我有一个字符一个正则表达式。我找不到任何有趣的东西。我已经修改了这一行

rx_dctvals = {re.compile("|".join(sorted([to_regex(v) if not isinstance(val,str) else to_regex(''.join([t for t in val])) for v in val],key=len,reverse=True))):key for key,val in dic.items()}

多么愚蠢的错误。我为此损失了一天。

正则表达式和多个函数和 f 字符串的列表理解问题

如何解决正则表达式和多个函数和 f 字符串的列表理解问题

解决方法

相关推荐