如何解决如何使用 Pyspark 和 NLTK 计算 POS 标签?
我有一些文本或一个大文件,我需要使用 NLTK 和 Pyspark 来计算 POS 标签的数量。我找不到导入文本文件的方法,所以我尝试添加一个短字符串但失败了。
计数线需要包含 pyspark。
##textfile = sc.textfile('')
##or
##textstring = """This is just a bunch of words to use for this example. John gave ##them to me last night but Kim took them to work. Hi Stacy. ###'''URL:http://example.com'''"""
tstring = sc.parallelize(List(textstring)).collect()
TOKEN_RE = re.compile(r"\b[\w']+\b")
dropURL=text.filter(lambda x: "URL" not in x)
words = dropURL.flatMap(lambda dropURL: dropURL.split(" "))
nltkwords = words.flatMap(lambda words: nltk.tag.pos_tag(nltk.regexp_tokenize(words,TOKEN_RE)))
#word_counts =nltkwords.map(lambda nltkwords: (ntlkwords,1))
nltkwords.take(50)
解决方法
这是您的测试字符串的示例。我认为您只是缺少按空格拆分字符串的步骤。否则整行将被删除,因为 URL 在该行中。
import nltk
import re
textstring = """This is just a bunch of words to use for this example. John gave ##them to me last night but Kim took them to work. Hi Stacy. ###'''URL:http://example.com'''"""
TOKEN_RE = re.compile(r"\b[\w']+\b")
text = sc.parallelize(textstring.split(' '))
dropURL = text.filter(lambda x: "URL" not in x)
words = dropURL.flatMap(lambda dropURL: dropURL.split(" "))
nltkwords = words.flatMap(lambda words: nltk.tag.pos_tag(nltk.regexp_tokenize(words,TOKEN_RE)))
nltkwords.collect()
# [('This','DT'),('is','VBZ'),('just','RB'),('a',('bunch','NN'),('of','IN'),('words','NNS'),('to','TO'),('use',('for',('this',('example',('John','NNP'),('gave','VBD'),('them','PRP'),('me',('last','JJ'),('night',('but','CC'),('Kim',('took',('work',('Hi',('Stacy','NN')]
要计算 pos 标签的出现次数,您可以执行一个 reduceByKey:
word_counts = nltkwords.map(lambda x: (x[1],1)).reduceByKey(lambda x,y: x + y)
word_counts.collect()
# [('NNS',1),('TO',3),('CC',('DT',('JJ',('VBZ',('RB',('NN',7),('VBD',2),('PRP',('IN',('NNP',2)]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。