如何解决WordNet层次结构的索引级别
我想在WordNet名词数据集中找到每个节点的“级别”。根据{{3}},“ entity.n.01”是根节点,因此我们只需要找到每个节点到此根的距离即可。但是,当我运行此代码时,会得到一些意外的答案。例如,
#import data
import re
import pandas
from nltk.corpus import wordnet as wn
from tqdm import tqdm
try:
wn.all_synsets
except LookupError as e:
import nltk
nltk.download('wordnet')
# make sure each edge is included only once
edges = set()
for synset in tqdm(wn.all_synsets(pos='n')):
# write the transitive closure of all hypernyms of a synset to file
for hyper in synset.closure(lambda s: s.hypernyms()):
edges.add((synset.name(),hyper.name()))
# also write transitive closure for all instances of a synset
for instance in synset.instance_hyponyms():
for hyper in instance.closure(lambda s: s.instance_hypernyms()):
edges.add((instance.name(),hyper.name()))
for h in hyper.closure(lambda s: s.hypernyms()):
edges.add((instance.name(),h.name()))
# dataframe of nouns
nouns = pandas.DataFrame(list(edges),columns=['id1','id2'])
#collect edges
edge_list = pd.DataFrame(columns=['child','parent'])
child_list = []
parent_list = []
for row in range(nouns.shape[0]):
if row > 0 and row % 100000 == 0:
print(f'row {row}/{nouns.shape[0]}')
w1 = nouns.iloc[row]['id1']
w2 = nouns.iloc[row]['id2']
ss_w1 = wn.synset(w1)
ss_w2 = wn.synset(w2)
level_w1 = min([len(path) for path in ss_w1.hypernym_paths()])
level_w2 = min([len(path) for path in ss_w2.hypernym_paths()])
child_list.append(f'{w1}.L{level_w1}')
parent_list.append(f'{w2}.L{level_w2}')
这给了我一个看起来像这样的数据框:
我在每个名词后面加上了相关的等级,以“ L [level]”的形式出现在边缘集中。但是出了点问题。如果我们查看最后一个条目(行743085),则会将单词“ bulimia.n.02”分配给级别9,即,它距根节点9步之遥,而其父节点显然是“ entity.n.01”这是根节点本身。
我做错了什么?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。