标记文本中出现的所有位置

如何解决标记文本中出现的所有位置

我有一个字符串列表和 2 个查找表。

说，文字：“Barrack Obama was president of the United States”

LookupA: ["Barrack","Barrack Obama"]
LookupB: ["United","United States","president"]

我需要一种计算成本低的方法和 pythonic 用位置标记所有出现的方法，

结果：[("Barrack","A"),("Barrack Obama",("president",18,"B"),("United",35,("United States","B")]

我目前有一种非常低效的处理方式。我想这可以使用 Tries 结构快速完成，但我不知道如何在文本流上以 pythonic 方式使用它。如果可以简化问题，那么在单词（而不是子单词）级别标记单词也足以满足我的用例。

我的低效代码如下：

annotations_all = []
for text_index,text in enumerate(texts):
    annotations = []
    found_uniq_entities_tup = {}

    for entity in lookupA:
        if entity not in found_uniq_entities_tup:
            start_index = str(text).find(entity)
            if not start_index == -1:
                found_uniq_entities_tup[entity] = 'A'

    for entity in lookupB:
        if entity not in found_uniq_entities_tup:
            start_index = str(text).find(entity)
            if not start_index == -1:
                found_uniq_entities_tup[entity] = 'B'

    def find_all(super_string: str,sub_string: str):
        start = 0
        while True:
            start = super_string.find(sub_string,start)
            if start == -1:
                return
            yield start
            start += len(sub_string)

    # Find all mentions of all found entities
    for key in found_uniq_entities_tup:
        start_index_list = find_all(str(text),str(key))
        for start_index in start_index_list:
            if not start_index == -1:
                annotations.append({"start": start_index,"end": start_index + len(key) - 1,"entity": key,"label": found_uniq_entities_tup[key]})
    annotations_all.append(annotations)

感谢任何帮助！

解决方法

您可以使用正则表达式组合所有关键字并将匹配项映射到标签字典。唯一的问题是您的某些关键字包含较小的关键字。这可以通过为关键字中的每个字数生成单独的正则表达式并根据每组模式检查文本来处理。

示例：

import re
tags = {"barrack":"A","barrack obama":"A","united":"B","united states":"B","president":"B"}

patterns = dict()
for tag in tags: # group keywords by number of words
    patterns.setdefault(tag.count(" "),[]).append(tag)
patterns = [re.compile(r"\b("+"|".join(tn)+r")\b",flags=re.I) 
             for tn in patterns.values()] # regular expression for each group

# generator function to find/return tagged words
def tagWords(text):
    for pattern in patterns: # lookup for each keyword group
        for match in pattern.finditer(text):    # go through matches
            word = match.group()                # matched keyword
            pos  = match.start()                # position in string
            yield (word,pos,tags[word.lower()]) # output tagged word

输出：

text = "Barrack Obama was president of the United States"
for tag in tagWords(text): print(tag)
('Barrack','A')
('president',18,'B')
('United',35,'B')
('Barrack Obama','A')
('United States','B')