python – nltk：如何防止专有名词的堵塞

我正在尝试使用Stanford POS标记器和NER编写关键字提取程序.对于关键字提取,我只对专有名词感兴趣.这是基本方法

>通过删除除字母之外的任何内容来清理数据
>删除停用词
>干每个字
>确定每个单词的POS标签
>如果POS标签是名词,则将其提供给NER
>然后,NER将确定该单词是个人,组织还是位置

docText="'Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London"

words = re.split("\W+",docText) 

stops = set(stopwords.words("english"))

#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]

# stemming
pstem = Porterstemmer()

words = [pstem.stem(w) for w in words]    

nounsWeWant = set(['NN','NNS','NNP','NNPS'])

finalWords = []

stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 

for w in words:
    if stp.tag([w.lower()])[0][1] not in nounsWeWant:
        finalWords.append(w.lower())
    else:
        finalWords.append(w)

finalString = " ".join(finalWords)
print finalString

tagged = stn.tag(finalWords)
print tagged

这给了我

Jack Frost work Boe Compani manag aircraft crew London
[(u'Jack',u'PERSON'),(u'Frost',(u'work',u'O'),(u'Boe',(u'Compani',(u'manag',(u'aircraft',(u'crew',(u'London',u'LOCATION')]

很明显,我不希望波音被阻止.也不是公司.因为我的输入可能包含像Performing这样的术语,所以我需要阻止这些词语.我已经看到像NING这样的词会被NER选为专有名词,因此可以归类为组织.因此,首先我阻止所有单词并转换为小写.然后我检查这个单词的POS标签是否是名词.如果是这样,我保持原样.如果没有,我将单词转换为小写并将其添加到将传递给NER的最终单词列表中.

关于如何避免扼杀专有名词的任何想法？

最佳答案

使用完整的Stanford CoreNLP管道来处理您的NLP工具链.避免使用自己的标记器,清洁器,POS标签器等.使用NER工具时效果不佳.

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
unzip http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
cd stanford-corenlp-full-2015-12-09
echo "Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London" > test.txt
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file test.txt
cat test.txt.out

[OUT]：

emma>Jackemma>
            aracterOffsetBegin>0aracterOffsetBegin>
            aracterOffsetEnd>4aracterOffsetEnd>
            emma>Frostemma>
            aracterOffsetBegin>5aracterOffsetBegin>
            aracterOffsetEnd>10aracterOffsetEnd>
            emma>workemma>
            aracterOffsetBegin>11aracterOffsetBegin>
            aracterOffsetEnd>16aracterOffsetEnd>
            emma>foremma>
            aracterOffsetBegin>17aracterOffsetBegin>
            aracterOffsetEnd>20aracterOffsetEnd>
            emma>Boeingemma>
            aracterOffsetBegin>21aracterOffsetBegin>
            aracterOffsetEnd>27aracterOffsetEnd>
            emma>Companyemma>
            aracterOffsetBegin>28aracterOffsetBegin>
            aracterOffsetEnd>35aracterOffsetEnd>
            emma>.emma>
            aracterOffsetBegin>35aracterOffsetBegin>
            aracterOffsetEnd>36aracterOffsetEnd>
            nor idx="0">ROOTnor>
            nor idx="2">Frostnor>
            nor idx="3">worksnor>
            nor idx="6">Companynor>
            nor idx="6">Companynor>
            nor idx="3">worksnor>
            nor idx="3">worksnor>
            nor idx="0">ROOTnor>
            nor idx="2">Frostnor>
            nor idx="3">worksnor>
            nor idx="6">Companynor>
            nor idx="6">Companynor>
            nor idx="3">worksnor>
            nor idx="3">worksnor>
            nor idx="0">ROOTnor>
            nor idx="2">Frostnor>
            nor idx="3">worksnor>
            nor idx="6">Companynor>
            nor idx="6">Companynor>
            nor idx="3">worksnor>
            nor idx="3">worksnor>
            emma>heemma>
            aracterOffsetBegin>37aracterOffsetBegin>
            aracterOffsetEnd>39aracterOffsetEnd>
            emma>manageemma>
            aracterOffsetBegin>40aracterOffsetBegin>
            aracterOffsetEnd>47aracterOffsetEnd>
            emma>5emma>
            aracterOffsetBegin>48aracterOffsetBegin>
            aracterOffsetEnd>49aracterOffsetEnd>
            normalizednER>5.0normalizednER>
            emma>aircraftemma>
            aracterOffsetBegin>50aracterOffsetBegin>
            aracterOffsetEnd>58aracterOffsetEnd>
            emma>andemma>
            aracterOffsetBegin>59aracterOffsetBegin>
            aracterOffsetEnd>62aracterOffsetEnd>
            emma>theyemma>
            aracterOffsetBegin>63aracterOffsetBegin>
            aracterOffsetEnd>68aracterOffsetEnd>
            emma>crewemma>
            aracterOffsetBegin>69aracterOffsetBegin>
            aracterOffsetEnd>73aracterOffsetEnd>
            emma>inemma>
            aracterOffsetBegin>74aracterOffsetBegin>
            aracterOffsetEnd>76aracterOffsetEnd>
            emma>Londonemma>
            aracterOffsetBegin>77aracterOffsetBegin>
            aracterOffsetEnd>83aracterOffsetEnd>
            nor idx="0">ROOTnor>
            nor idx="2">managesnor>
            nor idx="4">aircraftnor>
            dobj">
            nor idx="2">managesnor>
            nor idx="4">aircraftnor>
            nor idx="7">crewnor>
            nor idx="4">aircraftnor>
            nor idx="9">Londonnor>
            nor idx="7">crewnor>
            nor idx="0">ROOTnor>
            nor idx="2">managesnor>
            nor idx="4">aircraftnor>
            dobj">
            nor idx="2">managesnor>
            nor idx="4">aircraftnor>
            nor idx="7">crewnor>
            nor idx="4">aircraftnor>
            nor idx="9">Londonnor>
            nor idx="7">crewnor>
            nor idx="0">ROOTnor>
            nor idx="2">managesnor>
            nor idx="4">aircraftnor>
            dobj">
            nor idx="2">managesnor>
            nor idx="4">aircraftnor>
            nor idx="7">crewnor>
            dobj" extra="true">
            nor idx="2">managesnor>
            nor idx="4">aircraftnor>
            nor idx="9">Londonnor>
            nor idx="7">crewnor>


或者获取json输出：

java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,dcoref -file test.txt -outputFormat json

如果你真的需要一个python包装器,请参阅https://github.com/smilli/py-corenlp

$cd stanford-corenlp-full-2015-12-09
$export CLAsspATH=protobuf.jar:joda-time.jar:jollyday.jar:xom-1.2.10.jar:stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:slf4j-api.jar 
$java -mx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer &
cd
$git clone https://github.com/smilli/py-corenlp.git
$cd py-corenlp
$python
>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ("Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London")
>>> output = nlp.annotate(text,properties={'annotators': 'tokenize,ner','outputFormat': 'json'})
>>> output
{u'sentences': [{u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE',u'index': 0,u'tokens': [{u'index': 1,u'word': u'Jack',u'lemma': u'Jack',u'after': u' ',u'pos': u'NNP',u'characterOffsetEnd': 4,u'characterOffsetBegin': 0,u'originalText': u'Jack',u'ner': u'PERSON',u'before': u''},{u'index': 2,u'word': u'Frost',u'lemma': u'Frost',u'characterOffsetEnd': 10,u'characterOffsetBegin': 5,u'originalText': u'Frost',u'before': u' '},{u'index': 3,u'word': u'works',u'lemma': u'work',u'pos': u'VBZ',u'characterOffsetEnd': 16,u'characterOffsetBegin': 11,u'originalText': u'works',u'ner': u'O',{u'index': 4,u'word': u'for',u'lemma': u'for',u'pos': u'IN',u'characterOffsetEnd': 20,u'characterOffsetBegin': 17,u'originalText': u'for',{u'index': 5,u'word': u'Boeing',u'lemma': u'Boeing',u'characterOffsetEnd': 27,u'characterOffsetBegin': 21,u'originalText': u'Boeing',u'ner': u'ORGANIZATION',{u'index': 6,u'word': u'Company',u'lemma': u'Company',u'after': u'',u'characterOffsetEnd': 35,u'characterOffsetBegin': 28,u'originalText': u'Company',{u'index': 7,u'word': u'.',u'lemma': u'.',u'pos': u'.',u'characterOffsetEnd': 36,u'characterOffsetBegin': 35,u'originalText': u'.',u'before': u''}]},{u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE',u'index': 1,u'word': u'He',u'lemma': u'he',u'pos': u'PRP',u'characterOffsetEnd': 39,u'characterOffsetBegin': 37,u'originalText': u'He',u'word': u'manages',u'lemma': u'manage',u'characterOffsetEnd': 47,u'characterOffsetBegin': 40,u'originalText': u'manages',u'word': u'5',u'lemma': u'5',u'normalizednER': u'5.0',u'pos': u'CD',u'characterOffsetEnd': 49,u'characterOffsetBegin': 48,u'originalText': u'5',u'ner': u'NUMBER',u'word': u'aircraft',u'lemma': u'aircraft',u'pos': u'NN',u'characterOffsetEnd': 58,u'characterOffsetBegin': 50,u'originalText': u'aircraft',u'word': u'and',u'lemma': u'and',u'pos': u'CC',u'characterOffsetEnd': 62,u'characterOffsetBegin': 59,u'originalText': u'and',u'word': u'their',u'lemma': u'they',u'pos': u'PRP$',u'characterOffsetEnd': 68,u'characterOffsetBegin': 63,u'originalText': u'their',u'word': u'crew',u'lemma': u'crew',u'characterOffsetEnd': 73,u'characterOffsetBegin': 69,u'originalText': u'crew',{u'index': 8,u'word': u'in',u'lemma': u'in',u'characterOffsetEnd': 76,u'characterOffsetBegin': 74,u'originalText': u'in',{u'index': 9,u'word': u'London',u'lemma': u'London',u'characterOffsetEnd': 83,u'characterOffsetBegin': 77,u'originalText': u'London',u'ner': u'LOCATION',u'before': u' '}]}]}
>>> annotated_sent0 = output['sentences'][0]
>>> for token in annotated_sent0['tokens']:
...     print token['word'],token['lemma'],token['pos'],token['ner']
... 
Jack Jack NNP PERSON
Frost Frost NNP PERSON
works work VBZ O
for for IN O
Boeing Boeing NNP ORGANIZATION
Company Company NNP ORGANIZATION
. . . O

可能这是你想要的输出：

>>> " ".join(token['lemma'] for token in annotated_sent0['tokens'])
Jack Frost work for Boeing Company
>>> " ".join(token['word'] for token in annotated_sent0['tokens'])
Jack Frost works for Boeing Company

如果你想要一个NLTK附带的包装器,那么你必须等待一段时间,直到this issue解决; P


版权声明：本文内容由互联网用户自发贡献，该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容， 请发送邮件至 dio@foxmail.com 举报，一经查实，本站将立刻删除。

python – nltk：如何防止专有名词的堵塞

相关推荐