类型错误：词形还原 nltk

如何解决类型错误：词形还原 nltk

然后必须将这些文件标记为句子，然后将每个句子转换为单词数组，然后可以通过 nltk 标记器进行标记。这样就可以完成词形还原，然后在其上添加词干。此代码来自 (How to provide (or generate) tags for nltk lemmatizers)

from nltk.tokenize import sent_tokenize,word_tokenize
    # use sent_tokenize to split text into sentences,and word_tokenize to
    # to split sentences into words
    from nltk.tag import pos_tag
    # use this to generate array of tuples (word,tag)
    # it can be then translated into wordnet tag as in
    # [this response][1]. 
    from nltk.stem.wordnet import WordNetLemmatizer
    
    # code from response mentioned above
    def get_wordnet_pos(treebank_tag):
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            return ''    
    
    
    with open('filename.csv','r') as f:
        data = f.read()
        sentences = sent_tokenize(data)
        ignoreTypes = ['TO','CD','.','LS',''] # my choice
        sentence =[]
        lmtzr = WordNetLemmatizer()
        for sent in sentences:
            words = word_tokenize(sentence)
            tags = pos_tag(words)
            for (word,type) in tags:
                if type in ignoreTypes:
                    continue
                tag = get_wordnet_pos(type)
                if tag == '':
                    continue
                lema = lmtzr.lemmatize(word,tag)

当我尝试使用上述代码时，会出现以下错误以及如何将结果写入 csv 文件？

TypeError                                 Traceback (most recent call last)
<ipython-input-8-b89f61d662a8> in <module>()
     29     lmtzr = WordNetLemmatizer()
     30     for sent in sentences:
---> 31         words = word_tokenize(sentence)
     32         tags = pos_tag(words)
     33         for (word,type) in tags:

8 frames
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/punkt.py in _slices_from_text(self,text)
   1287     def _slices_from_text(self,text):
   1288         last_break = 0
-> 1289         for match in self._lang_vars.period_context_re().finditer(text):
   1290             context = match.group() + match.group('after_tok')
   1291             if self.text_contains_sentbreak(context):

TypeError: expected string or bytes-like object

谢谢

解决方法

我在这里看到了两个问题。首先，关于错误：发生这种情况是因为您将 sentence 变量作为列表传递给 word_tokenize，后者需要字符串（“asd”）或对象（类实例，例如 MyClass）。

其次，您将 sentence 传递给 word_tokenize，从您的迭代来看，您可能想要传递 sent ?

如果我错了，请忽略第二个建议。

祝你好运

该错误表明您的 sentences 变量的类型不是字符串或字节，正如预期的那样。我不知道调用 sent_tokenize(data) 返回了什么值（类型）。

检查句子变量类型的一种肮脏方法是使用 print(type(sentences)) 在调用 sent_tokenize 之后。