如何使用spacy删除停用词并在pandas数据框中获取引理？

如何解决如何使用spacy删除停用词并在pandas数据框中获取引理？

我在 python 的 Pandas 数据框中有一列标记。看起来像的东西：

 word_tokens
 (the,cheeseburger,was,great)
 (i,never,did,like,the,pizza,too,much)
 (yellow,submarine,only,an,ok,song)

我想使用 spacy 库在此数据框中再获取两个新列。一列包含移除了停用词的每一行的标记，另一列包含来自第二列的引理。我怎么能这样做？

解决方法

您将文本设为 spaCy 类型是正确的 - 您想将每个标记元组转换为 spaCy Doc。从那里，最好使用标记的属性来回答“标记是停止词吗”（使用 token.is_stop）或“这个标记的引理是什么”（使用 {{1 }}）。我的实现如下，我稍微更改了您的输入数据以包含一些复数示例，以便您可以看到词形还原正常工作。

token.lemma_

初始数据帧如下所示：

	word_tokens
0	('the','cheeseburger','was','great')
1	('i','never','did','like','the','pizzas','too','much')
2	('yellowed','submarines','only','an','ok','song')

我定义了执行主要任务的函数：

令牌元组 -> spaCy Doc
spaCy Doc -> 非停用词列表
spaCy Doc -> 不间断的词形还原词列表

import spacy
import pandas as pd

nlp = spacy.load('en_core_web_sm')

texts = [('the','great'),('i','much'),('yellowed','song')]

df = pd.DataFrame({'word_tokens': texts})

应用这些看起来像：

def to_doc(words:tuple) -> spacy.tokens.Doc:
    # Create SpaCy documents by joining the words into a string
    return nlp(' '.join(words))

def remove_stops(doc) -> list:
    # Filter out stop words by using the `token.is_stop` attribute
    return [token.text for token in doc if not token.is_stop]

def lemmatize(doc) -> list:
    # Take the `token.lemma_` of each non-stop word
    return [token.lemma_ for token in doc if not token.is_stop]

你得到的输出应该是这样的：

	word_tokens	removed_stops	词形化
0	('the','great')	['cheeseburger','great']	['cheeseburger','great']
1	('i','much')	['like','pizzas']	['like','pizza']
2	('yellowed','song')	['yellowed','song']	['yellow','submarine','song']

根据您的用例，您可能想要探索 spaCy 的文档对象 (https://spacy.io/api/doc) 的其他属性。特别是，如果您想从文本中提取更多含义，请查看 # create documents for all tuples of tokens docs = list(map(to_doc,df.word_tokens)) # apply removing stop words to all df['removed_stops'] = list(map(remove_stops,docs)) # apply lemmatization to all df['lemmatized'] = list(map(lemmatize,docs)) 和 doc.noun_chunks。

还值得注意的是，如果您打算将其用于大量文本，则应考虑doc.ents：https://spacy.io/usage/processing-pipelines。它可以批量处理您的文档，而不是一个一个，并且可以提高您的实施效率。

如果你使用 spacy，你应该让你的文本成为 spacy 类型，所以像这样：

 nlp = spacy.load("en_core_web_sm")
 text = topic_data['word_tokens'].values.tolist()
 text = '.'.join(map(str,text))
 text = nlp(text)

这使得使用起来更容易。然后你可以像这样标记单词

 token_list = []
    for token in text:
    token_list.append(token.text)

并像这样删除停用词。
token_list= [如果不是 nlp.Defaults.stop_words 中的单词，则在 token_list 中逐字逐句]

我还没有弄清楚词形还原部分，但这是一个开始。

如何使用spacy删除停用词并在pandas数据框中获取引理？

如何解决如何使用spacy删除停用词并在pandas数据框中获取引理？

解决方法

相关推荐