非结构化数据，NLP Lemmatize 书评

如何解决非结构化数据，NLP Lemmatize 书评

在这里，我尝试阅读内容，比如说“book1.txt”，在这里我必须删除所有特殊字符和标点符号，并使用 nltk 的单词标记器对内容进行单词标记。使用 wordnetLemmatizer 对这些标记进行词形还原并将这些令牌一一写入csv文件。这是我正在使用的代码，它显然不起作用，但只需要一些建议。

    import nltk
from nltk.stem import WordNetLemmatizer
import csv
from nltk.tokenize import word_tokenize

file_out=open('data.csv','w')
with open('book1.txt','r') as myfile:
  for s in myfile:
    words = nltk.word_tokenize(s)
    words=[word.lower() for word in words if word.isalpha()]
    for word in words:
      token=WordNetLemmatizer().lemmatize(words,'v')
      filtered_sentence=[""]
      for n in words:
        if n not in token:
          filtered_sentence.append(""+n)
        file_out.writelines(filtered_sentence+["\n"])

解决方法

这里有一些问题，最明显的是最后两个 for 循环。

你这样做的方式使它写成如下：

word1
word1word2
word1word2word3
word1word2word3word4
........etc

我猜这不是预期的输出。我假设预期的输出是：

word1
word2
word3
word4
........etc (without creating duplicates)

我将以下代码应用于 3 段 Cat Ipsum 文件。请注意，由于我自己的命名约定，我更改了一些变量名称。

import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from pprint import pprint


# read the text into a single string.
with open("book1.txt") as infile:
    text = ' '.join(infile.readlines())
words = word_tokenize(text)
words = [word.lower() for word in words if word.isalpha()]


# create the lemmatized word list
results = []
for word in words:
    # you were using words instead of word below
    token = WordNetLemmatizer().lemmatize(word,"v")
    # check if token not already in results. 
    if token not in results:
        results.append(token)


# sort results,just because :)
results.sort()

# print and save the results
pprint(results)
print(len(results))
with open("nltk_data.csv","w") as outfile:
    outfile.writelines(results)