如何解决删除熊猫数据集中的停用词
我正在尝试删除 apadas 数据集中的停用词,其中每一行都有一个词的标记化列表, 单词列表的格式如下:
['Uno',','dos','One','two','tres','quatro','Yes','Wooly','Bully','Watch','it','now','watch','Here','he','come','here','git','ya','Matty','told','Hattie','about','a','thing','she','saw','Had','big','horns','and','wooly','jaw','yes','drive','``','Let',"'s",'do',"n't",'take','no','chance','not','be','L-seven','learn','to','dance',"''",'Yeah','That','the','Get','you','someone','really','pull','wool','with','You','got','it']
使用以下代码执行此操作。ret = df['tokenized_lyric'].apply(lambda x: [item for item in x if item.lower() not in stops])
print(ret)
这让我得到如下列表
e0 [n,n,e,w,r,...
2165 [,l,p,...
似乎删除了几乎所有字符。 我如何让它只删除我设置的停用词?
解决方法
您正在使用列表推导式迭代字符串的字符。相反,在 lower()
之后,使用 split()
拆分字符串,然后迭代工作令牌,如下所示 -
print([i for i in 'hi there']) #iterating over the characters
print([i for i in 'hi there'.split()]) #iterating over the words
['h','i',' ','t','h','e','r','e']
['hi','there']
试试这个 lambda 函数 -
s = 'Hello World And Underworld'
stops = ['and','or','the']
f = lambda x: [item for item in x.split() if item.lower() not in stops]
f(s)
['hello','world','underworld']
W.r.t 你的代码,它会是 -
df['tokenized_lyric'].apply(lambda x: [item for item in x.split() if item.lower() not in stops])
,
from nltk.corpus import stopwords
# stop words from nltk library
stopwords = stopwords.words('english')
# user defined stop words
custom_stopwords = ['hey','hello']
# complete list of stop words
complete_stopwords = stopwords + custom_stopwords
#
df['lyrics_clean'] = df['lyrics'].apply(lambda x: [word for word in x.split() if word not in (complete_stopwords)])
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。