将char偏移量转换为单词偏移量的最简单方法

如何解决将char偏移量转换为单词偏移量的最简单方法

我有一个python字符串和一个选定文本的子字符串。例如，字符串可以是

stringy = "the bee buzzed loudly"

我想在此字符串中选择文本“蜂鸣”。我有此特定字符串的字符偏移量，即4-14。因为这些是所选文本之间的字符级索引。

将这些转换为单词级别索引（即1-2）的最简单方法是什么，因为正在选择第二个和第三个单词。我有许多这样标记的字符串，我想简单高效地转换索引。数据当前存储在字典中，如下所示：

data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}

我想将其转换为这种形式

data = {"string":"the bee buzzed loudly","start_word":1,"end_word":2}

谢谢！

解决方法

这里有一个简单的列表索引方法：

# set up data
string  = "the bee buzzed loudly"
words = string[4:14].split(" ") #get words from string using the charachter indices
stringLst = string.split(" ") #split string into words
dictionary = {"string":"","start_word":0,"end_word":0}


#process
dictionary["string"] = string
dictionary["start_word"] = stringLst.index(words[0]) #index of the first word in words
dictionary["end_word"] = stringLst.index(words[-1]) #index of the last
print(dictionary)

{'string': 'the bee buzzed loudly','start_word': 1,'end_word': 2}

请注意，这是假设您使用的是字符串中单词的时间顺序

这似乎是一个令牌化问题。我的解决方案是使用跨度标记器，然后在跨度中搜索子字符串跨度。因此，使用nltk库：

import nltk
tokenizer = nltk.tokenize.TreebankWordTokenizer()
# or tokenizer = nltk.tokenize.WhitespaceTokenizer()
stringy = 'the bee buzzed loudly'
sub_b,sub_e = 4,14  # substring begin and end
[i for i,(b,e) in enumerate(tokenizer.span_tokenize(stringy))
 if b >= sub_b and e <= sub_e]

但这有点复杂。 tokenizer.span_tokenize(stringy)返回所标识的每个令牌/单词的跨度。

请尝试此代码；

def char_change(dic,start_char,end_char,*arg):
    dic[arg[0]] = start_char
    dic[arg[1]] = end_char


data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}

start_char = int(input("Please enter your start character: "))
end_char = int(input("Please enter your end character: "))

char_change(data,"start_char","end_char")

print(data)

默认词典：

data = {"string":"the bee buzzed loudly","end_char":14}

输入

Please enter your start character: 1
Please enter your end character: 2

输出字典：

{'string': 'the bee buzzed loudly','start_char': 1,'end_char': 2}