如何解决遍历聚合文本值的函数时出错
这是代码:
def preprocess (texts):
case = truecase.get_true_case(texts)
doc = nlp(case)
return doc
def summarize_texts(texts):
doc = preprocess(texts)
actions = {}
entities = {}
for token in doc:
if token.pos_ == "VERB":
actions[token.lemma_] = actions.get(token.text,0) +1
for token in doc.ents:
entities[token.label_] = [token.text]
return {
'actions': actions,'entities': entities
})
我遇到的问题是该函数对单个输入的作用正常:
summarize_texts("Play something by Billie Holiday")
{'actions': {'play': 1},'entities': {'PERSON': ['Billie']}}
但目标是能够通过它传递列表或csv文件并将其汇总。
当我尝试时:
docs = [
"Play something by Billie Holiday","Set a timer for five minutes","Play it again,Sam"
]
summarize_texts(docs)
我得到了错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-200347d5cac5> in <module>()
4 "Play it again,Sam"
5 ]
----> 6 summarize_texts(docs)
5 frames
<ipython-input-16-08c879553d6e> in summarize_texts(texts)
1 def summarize_texts(texts):
----> 2 doc = preprocess(texts)
3 actions = {}
4 entities = {}
5 for token in doc:
<ipython-input-12-fccf767830b1> in preprocess(texts)
1 def preprocess (texts):
----> 2 case = truecase.get_true_case(texts)
3 doc = nlp(case)
4 return doc
/usr/local/lib/python3.6/dist-packages/truecase/__init__.py in get_true_case(sentence,out_of_vocabulary_token_option)
14 return get_truecaser().get_true_case(
15 sentence,---> 16 out_of_vocabulary_token_option=out_of_vocabulary_token_option)
/usr/local/lib/python3.6/dist-packages/truecase/TrueCaser.py in get_true_case(self,sentence,out_of_vocabulary_token_option)
97 as-is: Returns OOV tokens as is
98 """
---> 99 tokens = self.tknzr.tokenize(sentence)
100
101 tokens_true_case = []
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in tokenize(self,text)
293 """
294 # Fix HTML character entities:
--> 295 text = _replace_html_entities(text)
296 # Remove username handles
297 if self.strip_handles:
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in _replace_html_entities(text,keep,remove_illegal,encoding)
257 return "" if remove_illegal else match.group(0)
258
--> 259 return ENT_RE.sub(_convert_entity,_str_to_unicode(text,encoding))
260
261
TypeError: expected string or bytes-like object
我希望得到输出:
{'actions': {'play': 2,'set': 1},'entities': {'PERSON': ['Billie','Sam'],'TIME': ['five minutes']}}
不确定我的函数语法有什么问题。
解决方法
看起来您的问题是truecase.get_true_case(texts)
希望收到一个字符串/字节之类的参数,并且您正在向其传递一个字符串列表。
您需要遍历texts
并分别预处理列表中的每个项目:
def preprocess (text):
case = truecase.get_true_case(text)
doc = nlp(case)
return doc
def summarize_texts(texts):
actions = {}
entities = {}
for text in texts:
doc = preprocess(text)
for token in doc:
if token.pos_ == "VERB":
actions[token.lemma_] = actions.get(token.text,0) +1
for token in doc.ents:
entities[token.label_] = [token.text]
return {
'actions': actions,'entities': entities
})
,
在调用预处理程序之前,尝试对文本使用for循环
for i in texts:
doc = preprocess(i)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。