如何解决如何使用标记化数据从数据框中删除停用词?
我正在尝试从数据框中删除停用词。 每行只有一个名为 text 的列,我存储了文章的所有段落。
这是我尝试的第一种方法
stopwords = ['cat','dog','lion','fox']
df['text'] = df['text'].apply(lambda x: str.split(x))
df['text'] = df['text'].apply(lambda x: [item for item in x if item.lower() not in stop_words])
x=0
for i in df['text']:
df['text'][x] = ' '.join(i)
x += 1
df
奇怪的是,这并没有从 df['text']
中删除停用词中的所有单词。
我不明白为什么,所以我转向标记化。分词后,每个段落被划分成列。
从某些行超过 50,000 列的数据框中,如何删除停用词中的单词?
谢谢
解决方法
您可以尝试以下操作:
import pandas as pd
def remove_stop_words(sentence):
stop_words = ['cat','dog','lion','fox']
word_list=sentence.split()
clean_sentence=' '.join([w for w in word_list if w.lower() not in stop_words])
return(clean_sentence)
data = {'text':['the LION eat the cat','the dog is pretty','this Fox looks like a dog','there is no stop word here']}
df = pd.DataFrame(data)
#remove stopword
df['text'] = df['text'].apply(remove_stop_words)
结果:
text
0 the eat the
1 the is pretty
2 this looks like a
3 there is no stop word here
另一种解决方案是使用pandas.str.replace,但它可以创建许多连续的空格:
data = {'text':['the LION eat the cat','there is no stop word here']}
df = pd.DataFrame(data)
stop_words = ['cat','fox']
for stop in stop_words:
df['text']=df['text'].str.replace(stop,'',case=False)
结果:
text
0 the eat the
1 the is pretty
2 this looks like a
3 there is no stop word here
更新: 您可以使用 Regex 查找所有以停用词开头的词:
import pandas as pd
import re
def remove_stop_words(sentence):
stop_words = ['cat','fox']
for stop_word in stop_words:
#if you want to exclude only words with string with stop words + 1 letters => Lions
stop_words.extend(re.findall(r'\b'+stop_word+'[a-zA-Z]*\w+',sentence.lower()))
#if you want to exclude only words starting with stop words => Lions,Lionsss
regex = r'\b(#\w*[^#\W])\b'.replace('#',stop_word)
stop_words.extend(re.findall(regex,sentence.lower(),re.I))
word_list=sentence.split()
clean_sentence=' '.join([w for w in word_list if w.lower() not in stop_words])
return(clean_sentence)
data = {'text':['the LIONsss eat the cats','the dogs is pretty','there is no stop word here','lionz is not the plurial of lion']}
df = pd.DataFrame(data)
print(df)
#remove stopword
df['text'] = df['text'].apply(remove_stop_words)
print(df)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。