对于所有其他NLTK语料库,调用corpus.raw()会从文件中生成原始文本.
例如:
>>> from nltk.corpus import webtext
>>> webtext.raw()[:10]
'Cookie Man'
>>> from nltk.corpus import brown
>>> brown.raw()[:10]
'\n\n\tThe/at '
我已经阅读了所有可以找到的文档,但似乎找不到明显的解释或方法来获取未标记的版本.这个语料库是否有标记而其他语句没有?
解决方法:
TL; DR
import nltk
nltk.download('brown')
nltk.download('nonbreaking_prefixes')
nltk.download('perluniprops')
from nltk.corpus import brown
from nltk.tokenize.moses import MosesDetokenizer
mdetok = MosesDetokenizer()
brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True) for sent in brown.sents()]
for sent in brown_natural:
print(sent)
在龙
这是因为布朗语料库的“原始”版本被标记化并被标记,即语料库被标记为语料库的原始形式=)
您可以查看nltk_data目录中的各个文件:
$head -n10 nltk_data/corpora/brown/ca01
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.
The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.
The/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl Court/nn-tl Judge/nn-tl Durwood/np Pye/np to/to investigate/vb reports/nns of/in possible/jj ``/`` irregularities/nns ''/'' in/in the/at hard-fought/jj primary/nn which/wdt was/bedz won/vbn by/in Mayor-nominate/nn-tl Ivan/np Allen/np Jr./np ./.
如果你想要语料库中的单词,你可以使用brown.words(),例如
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> ' '.join(brown.words()[:30])
u"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"
>>> brown.fileids()[:10] # The first 10 fileids from brown.
[u'ca01', u'ca02', u'ca03', u'ca04', u'ca05', u'ca06', u'ca07', u'ca08', u'ca09', u'ca10']
>>> ' '.join(brown.words('ca01')[:30]) # First 30 words from the 'ca01' file.
u"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"
以及来自特定文件的句子:
>>> brown.sents('ca01')
[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]
打印出单个句子:
>>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
... print(' '.join(sent))
...
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of Voters and the size of this city '' .
The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .
试图使标记化的语料库松散而非常混乱,可能会或可能不会起作用,但您可以尝试使用MosesDetokenizer:
首先下载MosesDetokenizer所需的数据:
>>> import nltk
>>> nltk.download('perluniprops')
[nltk_data] Downloading package perluniprops to
[nltk_data] /home/ltan/nltk_data...
[nltk_data] Unzipping misc/perluniprops.zip.
True
>>> nltk.download('nonbreaking_prefixes')
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data] /home/ltan/nltk_data...
[nltk_data] Package nonbreaking_prefixes is already up-to-date!
True
然后初始化MosesDetokenizer:
>>> from nltk.tokenize.moses import MosesDetokenizer
>>> mdetok = MosesDetokenizer()
并使用MosesDetokenizer.detokenize():
>>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
... # Join the words in sentences and convert the `` -> "
... # also convert '' -> " and ` -> '
... munged_sentence = ' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'")
... print(mdetok.detokenize(munged_sentence.split(), return_str=True)) # MosesDetokenizer expects a list of strings as input.
...
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place.
The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, "deserves the praise and thanks of the City of Atlanta" for the manner in which the election was conducted.
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible "irregularities" in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr..
"Only a relative handful of such reports was received", the jury said, "considering the widespread interest in the election, the number of Voters and the size of this city".
The jury said it did find that many of Georgia's registration and election laws "are outmoded or inadequate and often ambiguous".
将褐色的每个句子转换为自然阅读文本:
from nltk.tokenize.moses import MosesDetokenizer
mdetok = MosesDetokenizer()
brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True) for sent in brown.sents()]
[OUT]:
>>> for sent in brown_natural:
... print(sent)
... break
...
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place.
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。