如何解决将char偏移量转换为单词偏移量的最简单方法
我有一个python字符串和一个选定文本的子字符串。例如,字符串可以是
stringy = "the bee buzzed loudly"
我想在此字符串中选择文本“蜂鸣”。我有此特定字符串的字符偏移量,即4-14。因为这些是所选文本之间的字符级索引。
将这些转换为单词级别索引(即1-2)的最简单方法是什么,因为正在选择第二个和第三个单词。我有许多这样标记的字符串,我想简单高效地转换索引。数据当前存储在字典中,如下所示:
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
我想将其转换为这种形式
data = {"string":"the bee buzzed loudly","start_word":1,"end_word":2}
谢谢!
解决方法
这里有一个简单的列表索引方法:
# set up data
string = "the bee buzzed loudly"
words = string[4:14].split(" ") #get words from string using the charachter indices
stringLst = string.split(" ") #split string into words
dictionary = {"string":"","start_word":0,"end_word":0}
#process
dictionary["string"] = string
dictionary["start_word"] = stringLst.index(words[0]) #index of the first word in words
dictionary["end_word"] = stringLst.index(words[-1]) #index of the last
print(dictionary)
{'string': 'the bee buzzed loudly','start_word': 1,'end_word': 2}
请注意,这是假设您使用的是字符串中单词的时间顺序
,这似乎是一个令牌化问题。 我的解决方案是使用跨度标记器,然后在跨度中搜索子字符串跨度。 因此,使用nltk库:
import nltk
tokenizer = nltk.tokenize.TreebankWordTokenizer()
# or tokenizer = nltk.tokenize.WhitespaceTokenizer()
stringy = 'the bee buzzed loudly'
sub_b,sub_e = 4,14 # substring begin and end
[i for i,(b,e) in enumerate(tokenizer.span_tokenize(stringy))
if b >= sub_b and e <= sub_e]
但这有点复杂。
tokenizer.span_tokenize(stringy)
返回所标识的每个令牌/单词的跨度。
请尝试此代码;
def char_change(dic,start_char,end_char,*arg):
dic[arg[0]] = start_char
dic[arg[1]] = end_char
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
start_char = int(input("Please enter your start character: "))
end_char = int(input("Please enter your end character: "))
char_change(data,"start_char","end_char")
print(data)
默认词典:
data = {"string":"the bee buzzed loudly","end_char":14}
输入
Please enter your start character: 1
Please enter your end character: 2
输出字典:
{'string': 'the bee buzzed loudly','start_char': 1,'end_char': 2}
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。