如何解决使用nltk停用词将其从pandas列的列表中删除时,删除停用词失败
我有一个带有字符串条目的数据框,我正在使用一个函数来删除停用词。该单元格可以编译,但不会产生预期的结果。
df['column'].iloc[0] = 'BK HE HAS KITCHEN TROUBLE WITH HIS BLENDER'
def text_process(text):
try :
nopunc = [char for char in text if char not in sting.punctuation]
nopunc = ' '.join(nopunc)
return [word for word in nopunc.split() if word.lower not in stopwords.words('english')
except TypeError: return []
df['column'].apply(text_process)
The first cell results look like this :
['BK ','HE','HAS','KITCHEN','TROUBLE','WITH','HIS','BLENDER']
(他,他的,他的)应该被删除,但它们仍然出现在单元格中吗?谁能解释这是怎么发生的或如何解决?
解决方法
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "BK HE HAS KITCHEN TROUBLE WITH HIS BLENDER"
example_sent=example_sent.lower()
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
[“ bk”,“他”,“有”,“厨房”,“麻烦”,“有”,“他的”,“搅拌器”]
['bk','厨房','麻烦','搅拌器']
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。