如何使用标记化数据从数据框中删除停用词？

如何解决如何使用标记化数据从数据框中删除停用词？

我正在尝试从数据框中删除停用词。每行只有一个名为 text 的列，我存储了文章的所有段落。

这是我尝试的第一种方法

stopwords  = ['cat','dog','lion','fox']
df['text'] = df['text'].apply(lambda x: str.split(x))
df['text'] = df['text'].apply(lambda x: [item for item in x if item.lower() not in stop_words])

x=0

for i in df['text']:
    df['text'][x] = ' '.join(i)
    x += 1
    
df

奇怪的是，这并没有从 df['text'] 中删除停用词中的所有单词。我不明白为什么，所以我转向标记化。分词后，每个段落被划分成列。

从某些行超过 50,000 列的数据框中，如何删除停用词中的单词？

谢谢

解决方法

您可以尝试以下操作：

import pandas as pd

def remove_stop_words(sentence):
    stop_words  = ['cat','dog','lion','fox']
    word_list=sentence.split()
    clean_sentence=' '.join([w for w in word_list if w.lower() not in stop_words])
    return(clean_sentence)
    

    
data = {'text':['the LION eat the cat','the dog is pretty','this Fox looks like a dog','there is no stop word here']}

df = pd.DataFrame(data)

#remove stopword

df['text'] = df['text'].apply(remove_stop_words)

结果：

                         text
0                 the eat the
1               the is pretty
2           this looks like a
3  there is no stop word here

另一种解决方案是使用pandas.str.replace，但它可以创建许多连续的空格：

data = {'text':['the LION eat the cat','there is no stop word here']}

df = pd.DataFrame(data)
stop_words  = ['cat','fox']
for stop in stop_words:
    df['text']=df['text'].str.replace(stop,'',case=False)

结果：

                         text
0               the  eat the
1              the  is pretty
2         this  looks like a
3  there is no stop word here

更新：您可以使用 Regex 查找所有以停用词开头的词：

import pandas as pd
import re

def remove_stop_words(sentence):
    stop_words  = ['cat','fox']
    for stop_word in stop_words:

        #if you want to exclude only words with string with stop words + 1 letters => Lions
        stop_words.extend(re.findall(r'\b'+stop_word+'[a-zA-Z]*\w+',sentence.lower()))

        #if you want to exclude only words starting with stop words  => Lions,Lionsss
        regex = r'\b(#\w*[^#\W])\b'.replace('#',stop_word)
        stop_words.extend(re.findall(regex,sentence.lower(),re.I))
    word_list=sentence.split()
    clean_sentence=' '.join([w for w in word_list if w.lower() not in stop_words])
    return(clean_sentence)
    

    
data = {'text':['the LIONsss eat the cats','the dogs is pretty','there is no stop word here','lionz is not the plurial of lion']}

df = pd.DataFrame(data)
print(df)

#remove stopword

df['text'] = df['text'].apply(remove_stop_words)
print(df)

如何使用标记化数据从数据框中删除停用词？

如何解决如何使用标记化数据从数据框中删除停用词？

解决方法

相关推荐