如何解决声音相似度的字符串之间的距离
两个词之间的相似性的定量描述是否基于它们的发音/发音方式,类似于 Levenshtein 距离?
我知道 soundex 给 similar sounding 词赋予相同的 id,但据我所知,它不是词之间差异的定量描述。
from jellyfish import soundex
print(soundex("two"))
print(soundex("to"))
解决方法
您可以结合语音编码和字符串比较算法。事实上,jellyfish
提供了两者。
设置库示例
from jellyfish import soundex,metaphone,nysiis,match_rating_codex,\
levenshtein_distance,damerau_levenshtein_distance,hamming_distance,\
jaro_similarity
from itertools import groupby
import pandas as pd
import numpy as np
dataList = ['two','too','to','fourth','forth','dessert','desert','Byrne','Boern','Smith','Smyth','Catherine','Kathryn']
sounds_encoding_methods = [soundex,match_rating_codex]
比较不同的语音编码
report = pd.DataFrame([dataList]).T
report.columns = ['word']
for i in sounds_encoding_methods:
print(i.__name__)
report[i.__name__]= report['word'].apply(lambda x: i(x))
print(report)
soundex metaphone nysiis match_rating_codex
word
two T000 TW TW TW
too T000 T T T
to T000 T T T
fourth F630 FR0 FART FRTH
forth F630 FR0 FART FRTH
dessert D263 TSRT DASAD DSRT
desert D263 TSRT DASAD DSRT
Byrne B650 BRN BYRN BYRN
Boern B650 BRN BARN BRN
Smith S530 SM0 SNAT SMTH
Smyth S530 SM0 SNYT SMYTH
Catherine C365 K0RN CATARAN CTHRN
Kathryn K365 K0RN CATRYN KTHRYN
您可以看到语音编码在使单词具有可比性方面做得非常好。您可以看到不同的案例,并根据您的案例选择一种或另一种。
现在我将采用上述方法并尝试使用 levenshtein_distance 找到最接近的匹配项,但我也可以使用其他方法。
"""Select the closer by algorithm
for instance levenshtein_distance"""
report2 = pd.DataFrame([dataList]).T
report2.columns = ['word']
report.set_index('word',inplace=True)
report2 = report.copy()
for sounds_encoding in sounds_encoding_methods:
report2[sounds_encoding.__name__] = np.nan
matched_words = []
for word in dataList:
closest_list = []
for word_2 in dataList:
if word != word_2:
closest = {}
closest['word'] = word_2
closest['similarity'] = levenshtein_distance(report.loc[word,sounds_encoding.__name__],report.loc[word_2,sounds_encoding.__name__])
closest_list.append(closest)
report2.loc[word,sounds_encoding.__name__] = pd.DataFrame(closest_list).\
sort_values(by = 'similarity').head(1)['word'].values[0]
print(report2)
soundex metaphone nysiis match_rating_codex
word
two too too too too
too two to to to
to two too too too
fourth forth forth forth forth
forth fourth fourth fourth fourth
dessert desert desert desert desert
desert dessert dessert dessert dessert
Byrne Boern Boern Boern Boern
Boern Byrne Byrne Byrne Byrne
Smith Smyth Smyth Smyth Smyth
Smyth Smith Smith Smith Smith
Catherine Kathryn Kathryn Kathryn Kathryn
Kathryn Catherine Catherine Catherine Catherine
从上面你可以清楚地看到语音编码和字符串比较算法之间的组合可以非常简单。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。