SpaCy，如何创建一个模式来匹配通过 SpeechRecognition 捕获的字符串？

如何解决SpaCy，如何创建一个模式来匹配通过 SpeechRecognition 捕获的字符串？

第一次来这里求助，希望一切都清楚！事实：我正在为角色扮演游戏 (GURPS) 构建一个应用程序，它跟踪玩家对敌人造成的伤害。应用程序本身已经完成，我使用 PySimpleGUI 作为图形界面。下一步，是集成语音命令，以便不是从键盘输入，而是从语音输入（因为有多个输入，所以，为什么不呢？）。所以，我使用 SpeechRecognition 库来捕捉语音输入，创建一个字符串变量来存储用户的输入。现在我正在处理第二部分：从字符串中提取输入。最后一部分是将这些输入存储到字典中，并将其用作我的函数的输入。

我正在努力实现的目标我在设计与 SpaCy 的比赛时遇到了很多问题。因为我认为没有数据库可以为我的任务训练 NN 或 ML 模型，所以我使用 Rules Matching。通过这种方式，任何句子都必须以某种方式进行结构化，以便按照我的意愿提取标记。一个句子的例子是这个： “你击中了敌方僵尸 1 号弱点 2 的头部，进行了一次大型穿刺攻击，造成 8 点伤害”。

我必须提取的输入如下：

enemy hit：zombie one（被击中的敌人，在创建的dataframe中可能会出现zombie_1、zombie_2等...，一般来说是多个zombie，附有序号。还是试着理解是否会最好将它们命名为zombie1、zombie2...）
漏洞“编号”
位置命中：在这种情况下是头部，但可能是“右臂”，我无法提取，因为标记化将它们视为 2 个标记而不是 1 个
大渗透：攻击类型（最简单的情况是“切割”或“粉碎”，一个词，容易接受，但我没有找到任何方法将这些标记一起提取，因为标记化是如何工作的)
伤害 8：造成的伤害

问题：我目前正在使用 DependencyMatcher。主要问题是：

因为标记化适用于单个单词，在上述情况下，我将丢失第二部分（右臂，仅提取臂；大穿透，仅穿透）。
无法概括我的模式，我不确定“DependencyMatcher”是否是正确的工具。我正在使用意大利语，但我正在用英语进行测试以求简洁。我目前的英语脚本是：

string = "You hit the enemy zombie one,that has vulnerability 2,to the head,with a large piercing attack,dealing 8 damage."
    nlp = spacy.load("en_core_news_sm")
    doc = nlp(string)
    # for token in doc:
    #     print(token.text,token.dep_)
    
# i'm going to create 2 lists with all words of body locations hit and type of attacks,in order to find the words via "LOWER" or "LemmA" dependency (first part of list is in english,second part in italian)
   
    body_list_words = ["Body","Head","Arm_right","Arm_left","Leg_right","Leg_left","Hand_right","Hand_left","Foot_right","Foot_left","Groin","Skull","Vitals","Neck","corpo","testa","braccio destro","braccio sinistro","gamba destra","gamba sinistra","mano destra","mano sinistra","piede destro","piede sinistro","testicoli","cranio","vitali","collo"]

    attack_type_words = ["cutting","impaling","crushing","small penetration","penetration","big penetration","huge penetration","burning","explosive","tagliente","impalamento","schiacciamento","penetrazione minore","piccola penetrazione","penetrazione","penetrazione maggiore","enorme penetrazione","infuocati","esplosivi"]


    ###############
    # Trovare i match
    ##############
    matcher = DependencyMatcher(nlp.vocab)
    # I'm starting finding the verb
    patterns = [{"RIGHT_ID": "anchor_verbo","RIGHT_ATTRS": {"POS": "VERB"}},# Looking for Obj (word: enemy)
                {"LEFT_ID": "anchor_verbo","REL_OP": ">","RIGHT_ID": "obj_verbo","RIGHT_ATTRS": {"DEP": "obj"}},# Looking for the name of the enemy: zombie1
                {"LEFT_ID": "obj_verbo","RIGHT_ID": "type_enemy","RIGHT_ATTRS": {"DEP": "nmod"}},# Looking for word: vulnerability
                {"LEFT_ID": "anchor_verbo","RIGHT_ID": "vulnerability","RIGHT_ATTRS": {"LemmA": "vulnerability"}},#Looking for number associated to Vulnerability
                {"LEFT_ID": "vulnerability","RIGHT_ID": "num_vulnerability","RIGHT_ATTRS": {"DEP": "nummod"}},#location of body hit
                {"LEFT_ID": "anchor_verbo","REL_OP": ">>","RIGHT_ID": "location","RIGHT_ATTRS": {"LOWER": {"IN": body_list_words}}},# Looking for word: attack,in order to find the type of attack
                {"LEFT_ID": "anchor_verbo","RIGHT_ID": "attack","RIGHT_ATTRS": {"POS": "NOUN"}},#Looking for type of attack
                {"LEFT_ID": "attack","RIGHT_ID": "type_attack","RIGHT_ATTRS": {"LemmA": {"IN": attack_type_words}}},#Looking for word: damage in order to extract the number
                {"LEFT_ID": "attack","RIGHT_ID": "word_damage",# Looking for the number
                {"LEFT_ID": "word_damage","RIGHT_ID": "num_damage","RIGHT_ATTRS": {"DEP": "nummod"}}

                ]

    matcher.add("Inputs1",[patterns])
    matches = matcher(doc)

    match_id,token_ids = matches[0]
    matched_words = []
    for i in range(len(token_ids)):
        #print(patterns[i]["RIGHT_ID"] + ":",doc[token_ids[i]].text)
        matched_words.append(doc[token_ids[i]].text)
    
#########
# Now i'm creating the dictionary,deleting first element
#########
    index_to_remove = [0]
    for index,elem in enumerate(index_to_remove):
        del matched_words[elem]
    print(matched_words)

    input_dict = {matched_words[0]: matched_words[1],"location": matched_words[4],matched_words[5]: matched_words[6],matched_words[7]: matched_words[8],matched_words[2]: matched_words[3]}

    #print(input_dict)
    return input_dict

要解决的一般问题：任何应该组合在一起的复杂词（如“右臂”、“左腿”、“大穿透”）都不能这样提取（只有手臂、腿或穿透）返回）。

你能帮我吗？谢谢！

解决方法

总结一下你的问题，你得到的是单个词，但你想捕获作为一个单元的多个词，比如“右臂”。

您可以使用依赖项匹配器执行此操作，但这需要一些工作。基本上你想匹配你现在得到的单个单词的整个子树。在短语“right arm”中，“arm”是中心名词，“right”依赖于“arm”。所有直接或间接（通过其他词）依赖于“arm”的词都被称为“子树”。

了解依赖关系有点复杂但非常强大。我建议您阅读 the Jurafsky and Martin book 中的第 14 章，这是一个直接的依赖解析指南。随意浏览大量内容。

也就是说，对于您想要的那种短语，您可以在 spaCy 中尝试一种更简单的方法。尝试使用 merge_noun_chunks 函数，该函数会将块转换为更易于使用的单个标记。

名词块有点难以定义，它在 spaCy 中的工作方式可能不是您想要的，但如果您愿意，您也可以查看它的来源以编写自己的定义。为了让它起作用，你必须了解依赖解析。