微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

拼音命名实体识别

如何解决拼音命名实体识别

我正在尝试进行命名实体识别或从拼音或汉字的罗马化中提取人、地等。

例如(来自维基百科):

 "Jiang Zemin,Li Peng and Zhu Rongji led the nation in the 1990s. Under their administration,China's economic performance pulled an estimated 150 million peasants out of poverty and sustained an average annual gross domestic product growth rate of 11.2%.[125][better source needed][126][better source needed] The country joined the World Trade Organization in 2001,and maintained its high rate of economic growth under Hu Jintao and Wen Jiabao's leadership in the 2000s. However,the growth also severely impacted the country's resources and environment,[127][128] and caused major social displacement.[129][130]
Chinese Communist Party general secretary Xi Jinping has ruled since 2012 and has pursued large-scale efforts to reform China's economy [131][132] (which has suffered from structural instabilities and slowing growth),[133][134][135] and has also reformed the one-child policy and prison system,[136] as well as instituting a vast anti corruption crackdown.[137] In 2013,China initiated the Belt and Road Initiative,a global infrastructure investment project.[138] The COVID-19 pandemic broke out in Wuhan,Hubei in 2019.[139][140]"

我希望从上面提取实体,例如:

Jiang Zemin
Li Peng
Zhu Rongji
Hu Jintao
Wuhan
Hubei
etc...

汉字NER很复杂,但我不知道有什么方法可以提取拼音。

我目前的计划是尝试以下 1300 多个中文音节的所有排列:

import pandas as pd
import numpy as np

#import data
data = pd.read_csv('chinese_tones.txt',sep=" ",header=None)
data.columns = ["pinyin","character"]

#convert
data['pinyin'] = data['pinyin'].str.replace('\d+','') #data doesn't have tones,which makes this harder
s = data['pinyin'].drop_duplicates().to_numpy()
combos = pd.Series(np.add.outer(s,s).ravel())

#combine to giant list
all_pinyin = pd.Series(s.tolist() + np.add.outer(s,s).ravel().tolist())

然后我打算做一些类似的事情 .isin() 将文本数据与拼音列表进行比较。

有谁知道提取实体拼音的更好方法吗?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。