从python列表中删除自定义单词

如何解决从python列表中删除自定义单词

我正在编写一个函数来进行自定义单词删除、词干提取（获取单词的词根形式），然后是 tf-idf。

我对该函数的输入数据是一个列表。如果我尝试对单个列表进行自定义单词删除，那行得通，但是当我将它组合到函数中时，我收到一个属性错误：

AttributeError: 'list' 对象没有属性 'lower'

这是我的代码：

def tfidf_kw(K):    
    # Select docs in cluster K
    docs = np.array(mydata2)[km_r3.labels_==K]

    ps= PorterStemmer()
    stem_docs = []
    for doc in docs:
        keep_tokens = []
        
        for token in doc.split(' '):
            #custom stopword removal
            my_list = ['model','models','modeling','modelling','python','train','training','trains','trained','test','testing','tests','tested']
            
            token  = [sub_token for sub_token in list(doc) if sub_token not in my_list]

            stem_token=ps.stem(token)
            keep_tokens.append(stem_token)

        keep_tokens =' '.join(keep_tokens)
        stem_docs.append(keep_tokens)

        return(keep_tokens)

进一步的代码用于 tf-idf，它可以工作。这是我需要帮助的地方，以了解我做错了什么？

token  = [sub_token for sub_token in list(doc) if sub_token not in my_list]

这是完整的错误：

AttributeError  Traceback (most recent call last)
<ipython-input-154-528a540678b0> in <module>
     49     #return(sorted_df)
     50 
---> 51 tfidf_kw(0)

<ipython-input-154-528a540678b0> in tfidf_kw(K)
     20 
     21 
---> 22             stem_token=ps.stem(token)
     23             keep_tokens.append(stem_token)
     24 

~/opt/anaconda3/lib/python3.8/site-packages/nltk/stem/porter.py in stem(self,word)
    650 
    651     def stem(self,word):
--> 652         stem = word.lower()
    653 
    654         if self.mode == self.NLTK_EXTENSIONS and word in self.pool:

AttributeError: 'list' object has no attribute 'lower'

在第 51 行，它显示 tfidf_kw(0)，这就是我检查 k=0 函数的地方。

解决方法

显然 ps.stem 方法需要一个单词（一个字符串）作为参数，但您传递的是一个字符串列表。

由于您已经在 for token in doc.split(' ') 循环中，因此我认为另外使用列表推导式 [... for sub_token in list(doc) ...] 似乎没有意义。

如果您的目标是跳过 my_list 中的那些标记，大概您想像这样编写 for token in doc.split(' ') 循环：

for token in doc.split(' '):
    my_list = ['model','models','modeling','modelling','python','train','training','trains','trained','test','testing','tests','tested']

    if token in my_list:
        continue
    
    stem_token=ps.stem(token)
    keep_tokens.append(stem_token)

这里，如果 token 是 my_list 中的单词之一，则 continue 语句跳过当前迭代的其余部分，循环继续下一个 token。

从python列表中删除自定义单词

如何解决从python列表中删除自定义单词

解决方法

相关推荐