遍历聚合文本值的函数时出错

如何解决遍历聚合文本值的函数时出错

我的功能有问题。设计是将单词标记聚合为字典。

这是代码：

def preprocess (texts):
   case = truecase.get_true_case(texts)
   doc = nlp(case)
   return doc

def summarize_texts(texts):
    doc = preprocess(texts) 
    actions = {}
    entities = {}
    for token in doc:
        if token.pos_ == "VERB":
            actions[token.lemma_] = actions.get(token.text,0) +1
    for token in doc.ents:
         entities[token.label_] = [token.text]
    return {
            'actions': actions,'entities': entities
        })

我遇到的问题是该函数对单个输入的作用正常：

summarize_texts("Play something by Billie Holiday")

{'actions': {'play': 1},'entities': {'PERSON': ['Billie']}}

但目标是能够通过它传递列表或csv文件并将其汇总。

当我尝试时：

docs = [
    "Play something by Billie Holiday","Set a timer for five minutes","Play it again,Sam"
]
summarize_texts(docs)

我得到了错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-200347d5cac5> in <module>()
      4     "Play it again,Sam"
      5 ]
----> 6 summarize_texts(docs)

5 frames
<ipython-input-16-08c879553d6e> in summarize_texts(texts)
      1 def summarize_texts(texts):
----> 2     doc = preprocess(texts)
      3     actions = {}
      4     entities = {}
      5     for token in doc:

<ipython-input-12-fccf767830b1> in preprocess(texts)
      1 def preprocess (texts):
----> 2    case = truecase.get_true_case(texts)
      3    doc = nlp(case)
      4    return doc

/usr/local/lib/python3.6/dist-packages/truecase/__init__.py in get_true_case(sentence,out_of_vocabulary_token_option)
     14     return get_truecaser().get_true_case(
     15         sentence,---> 16         out_of_vocabulary_token_option=out_of_vocabulary_token_option)

/usr/local/lib/python3.6/dist-packages/truecase/TrueCaser.py in get_true_case(self,sentence,out_of_vocabulary_token_option)
     97             as-is: Returns OOV tokens as is
     98         """
---> 99         tokens = self.tknzr.tokenize(sentence)
    100 
    101         tokens_true_case = []

/usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in tokenize(self,text)
    293         """
    294         # Fix HTML character entities:
--> 295         text = _replace_html_entities(text)
    296         # Remove username handles
    297         if self.strip_handles:

/usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in _replace_html_entities(text,keep,remove_illegal,encoding)
    257         return "" if remove_illegal else match.group(0)
    258 
--> 259     return ENT_RE.sub(_convert_entity,_str_to_unicode(text,encoding))
    260 
    261 

TypeError: expected string or bytes-like object

我希望得到输出：

{'actions': {'play': 2,'set': 1},'entities': {'PERSON': ['Billie','Sam'],'TIME': ['five minutes']}}

不确定我的函数语法有什么问题。

解决方法

看起来您的问题是truecase.get_true_case(texts)希望收到一个字符串/字节之类的参数，并且您正在向其传递一个字符串列表。

您需要遍历texts并分别预处理列表中的每个项目：

def preprocess (text):
   case = truecase.get_true_case(text)
   doc = nlp(case)
   return doc

def summarize_texts(texts):
    actions = {}
    entities = {}
    for text in texts:
        doc = preprocess(text) 
        for token in doc:
            if token.pos_ == "VERB":
                actions[token.lemma_] = actions.get(token.text,0) +1
        for token in doc.ents:
             entities[token.label_] = [token.text]
    return {
        'actions': actions,'entities': entities
    })

在调用预处理程序之前，尝试对文本使用for循环

for i in texts:
    doc = preprocess(i)