如何解决为什么我的程序不能像我编程的那样过滤掉停用词和标点符号? (Python & NLTK)
对于我的数据科学课程中的实验室,我必须使用 NLTK 用 Python 创建一个程序来进行自然语言处理。我们必须使用 for 循环来遍历 macbeth 的每个单词,并通过将非停止词/标点词添加到另一个列表来过滤掉所有英语停止词和标点符号。然后,我们必须从过滤后的列表中打印出最常见的单词及其频率的列表。我原以为我在逻辑上做的一切都是正确的,但结果包括标点符号和停用词(见下文)。我在这里做错了什么? (P.S. 这是我第一次使用 NLTK)。
计划:
# import required libraries and modules
import nltk
from nltk.corpus import gutenberg,stopwords
from nltk.probability import FreqDist
macbeth_allwords = gutenberg.words('shakespeare-macbeth.txt') # read in words from macbeth
macbeth_noStop = [] # empty list to hold words from macbeth excluding stopwords
punctuations = [".","!","?",",";",":","-","[","]","{","}","(",")","/","*","~","<",">","`","^","_","|","#","$","%","+","=","&","@"," "] # list of common punctuation characters
# iterate through each word in macbeth,making a new list excluding all the stopwords and punctuation characters
for word in macbeth_allwords:
if (word not in stopwords.words('english')) or (word not in punctuations):
macbeth_noStop.append(word)
macbeth_freq = FreqDist(macbeth_noStop) # get word frequencies from the filtered list of words from macbeth
# print the 50 most common words from the filtered list of words from macbeth
print("50 Most Common Words in Macbeth (no stopwords or punctuation):")
print("-----------------------------------------------")
print(macbeth_freq.most_common(50))
输出:
50 Most Common Words in Macbeth (no stopwords or punctuation):
-----------------------------------------------
[(',',1962),('.',1235),("'",637),('the',531),(':',477),('and',376),('I',333),('of',315),('to',311),('?',241),('d',224),('a',214),('you',184),('in',173),('my',170),('And',('is',166),('that',158),('not',155),('it',138),('Macb',137),('with',134),('s',131),('his',129),('be',124),('The',118),('haue',117),('me',111),('your',110),('our',103),('-',100),('him',90),('for',82),('Enter',80),('That',('this',79),('he',76),('What',74),('To',73),('so',70),('all',67),('thou',63),('are',('will',62),('Macbeth',61),('thee',('but',60),('But',('on',59),('they',58)]
解决方法
除了逻辑条件之外,一切都是正确的。
您打算使用 and
而不是 or
if word not in stopwords.word('english') and word not in punctuations
迂腐注意:您可以使用集合而不是列表作为标点符号,这样查找会更快:)
,就像前面的回答中提到的,使用的运算符不正确。
macbeth_noStop = [token for token in macbeth_allwords if token not in string.punctuation and token not in stopwords.words('english')]
此外,您可以导入 string 并使用 string.punctuation 代替。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。