为什么我的程序不能像我编程的那样过滤掉停用词和标点符号？ (Python & NLTK)

如何解决为什么我的程序不能像我编程的那样过滤掉停用词和标点符号？ (Python & NLTK)

对于我的数据科学课程中的实验室，我必须使用 NLTK 用 Python 创建一个程序来进行自然语言处理。我们必须使用 for 循环来遍历 macbeth 的每个单词，并通过将非停止词/标点词添加到另一个列表来过滤掉所有英语停止词和标点符号。然后，我们必须从过滤后的列表中打印出最常见的单词及其频率的列表。我原以为我在逻辑上做的一切都是正确的，但结果包括标点符号和停用词（见下文）。我在这里做错了什么？（P.S. 这是我第一次使用 NLTK）。

计划：

# import required libraries and modules
import nltk
from nltk.corpus import gutenberg,stopwords
from nltk.probability import FreqDist

macbeth_allwords = gutenberg.words('shakespeare-macbeth.txt') # read in words from macbeth
macbeth_noStop = [] # empty list to hold words from macbeth excluding stopwords
punctuations = [".","!","?",",";",":","-","[","]","{","}","(",")","/","*","~","<",">","`","^","_","|","#","$","%","+","=","&","@"," "] # list of common punctuation characters

# iterate through each word in macbeth,making a new list excluding all the stopwords and punctuation characters
for word in macbeth_allwords:
    if (word not in stopwords.words('english')) or (word not in punctuations):
        macbeth_noStop.append(word)

macbeth_freq = FreqDist(macbeth_noStop) # get word frequencies from the filtered list of words from macbeth

# print the 50 most common words from the filtered list of words from macbeth
print("50 Most Common Words in Macbeth (no stopwords or punctuation):")
print("-----------------------------------------------")
print(macbeth_freq.most_common(50))

输出：

50 Most Common Words in Macbeth (no stopwords or punctuation):
-----------------------------------------------
[(',',1962),('.',1235),("'",637),('the',531),(':',477),('and',376),('I',333),('of',315),('to',311),('?',241),('d',224),('a',214),('you',184),('in',173),('my',170),('And',('is',166),('that',158),('not',155),('it',138),('Macb',137),('with',134),('s',131),('his',129),('be',124),('The',118),('haue',117),('me',111),('your',110),('our',103),('-',100),('him',90),('for',82),('Enter',80),('That',('this',79),('he',76),('What',74),('To',73),('so',70),('all',67),('thou',63),('are',('will',62),('Macbeth',61),('thee',('but',60),('But',('on',59),('they',58)]

解决方法

除了逻辑条件之外，一切都是正确的。您打算使用 and 而不是 or

if word not in stopwords.word('english') and word not in punctuations

迂腐注意：您可以使用集合而不是列表作为标点符号，这样查找会更快:)

就像前面的回答中提到的，使用的运算符不正确。

macbeth_noStop = [token for token in macbeth_allwords if token not in string.punctuation and token not in stopwords.words('english')]

此外，您可以导入 string 并使用 string.punctuation 代替。

为什么我的程序不能像我编程的那样过滤掉停用词和标点符号？ (Python & NLTK)

如何解决为什么我的程序不能像我编程的那样过滤掉停用词和标点符号？ (Python & NLTK)

解决方法

相关推荐