加速用于比较句子的比较函数

如何解决加速用于比较句子的比较函数

我有一个形状为 (789174,9) 的数据框。有一列名为 resolution 的列包含长度小于 139 个字符的句子。我构建了一个函数来从 0.9 库中查找相似度高于 difflib 的句子。我有一台带有 96 cpus 和 384 gb ram 的虚拟计算机。我现在已经运行这个函数超过 2 小时了，它仍然没有在 i = 1000 时处理。我担心这会花费太长时间来处理，我想知道是否有办法加快速度。

def replace_similars(input_list):
    # Replaces %90 and more similar strings
    start_time = time.time()
    for i in range(len(input_list)):
        if i % 1000 == 0:
            print(f'time = {time.time()-start_time:.2f} - index = {i}')
        for j in range(len(input_list)):
            if i < j and difflib.SequenceMatcher(None,input_list[i],input_list[j]).ratio() >= 0.9:
                input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

显然，因为我们在列中迭代两次，所以它是 O(n^2)。我不确定是否有办法让它更快。任何建议将不胜感激。

编辑：

我已尝试使用 difflib 和 fuzzywuzzy 加快速度。该函数只遍历列一次，但我会遍历字典键。

def cluster_resolution(df):
    clusters = {}
    for string in df['resolution_modified'].unique():
        match1 = difflib.get_close_matches(string,clusters.keys(),cutoff=0.9)
        
        if match1:
            for m in match1:
                clusters[m].append(string)
        else:           
            clusters[string] = [ string ]
            for m in clusters.keys():
                match2 = fuzz.partial_ratio(string,m)
                if match2 >= 90:
                    clusters[m].append(string)
    return clusters
mappings = cluster_resolution(df_sample)

是否可以加速后一个功能？

这是一个数据帧中的一些数据的例子

d = {'resolution' : ['replaced scanner','replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use','tc reimage','updated pc','deploying replacement scanner','upgraded and rebooted station','printer has been reconfigured','cleared linux print queue and Now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly Now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is Now working','printer offlineswitched usb portprinter is Now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is Now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well Now']}

df = pd.DataFrame(data=d)

我如何定义相似性：

相似性实际上是由所采取的整体操作来定义的，例如 replaced scanner 和 replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use。更长的字符串整体操作是替换扫描仪，因此这两个非常相似，这就是我选择使用 partial_ratio 函数的原因，因为它们的得分为 100。

注意：

请参考第二个函数 cluster_resolution 因为这是我想要加速的函数。后一个功能不会有用。

解决方法

def replace_similars(input_list):
    # Replaces %90 and more similar strings
    start_time = time.time()
    for i in range(len(input_list)):
        if i % 1000 == 0:
            print(f'time = {time.time()-start_time:.2f} - index = {i}')
        for j in range(i+1,len(input_list)):
            if -15 < len(list(input_list[i])) - len(list(input_list[i])) < 15:
                if difflib.SequenceMatcher(None,input_list[i],input_list[j]).ratio() >= 0.9:
                    input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

尽管这可能不是一个实用的解决方案，因为如果每次迭代需要 0.1 秒，它也将需要大约 90 年的时间，但它仍然是一个更优化的解决方案。

关于你上次的编辑，我会做一些改变（主要是使用fuzzywuzzy.process 而不是fuzzywuzzy.fuzz）：

from fuzzywuzzy import process
def cluster_resolution(df):
    clusters = {}
    for string in df['resolution'].unique():        
        match1 = difflib.get_close_matches(string,clusters.keys(),cutoff=0.9)
        if match1:
            for m in match1:
                clusters[m].append(string)
        else:           
            bests = process.extractBests(
                    string,set(clusters.keys())-{string},scorer=fuzz.partial_ratio,score_cutoff=80,limit=1
                    )
            
            if bests:
                clusters[bests[0][0]].append(string)
            else:
                clusters[string] = [ string ]

但我认为您可以更多地研究其他解决方案，例如 CountVectorizer 以及在那里适应的任何指标。这是一种提高速度的方法（因为它是矢量化的），尽管结果可能不完美。请注意，CountVectorizer 可能对您来说是一个很好的解决方案，因为您已经选择了 partial_ratio。

例如，像这样：

from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist,squareform
import hdbscan

df = pd.DataFrame(d)

cv = CountVectorizer(stop_words="english")
transformed = cv.fit_transform(df['resolution'])
transformed = pd.DataFrame(
        transformed.toarray(),columns=cv.get_feature_names(),index=df['resolution'])

#keep only columns with more than 1
transformed = transformed[transformed.columns[transformed.sum()>2]]

#compute the distance matrix
d = pdist(transformed,metric="hamming") * transformed.shape[1]
s = squareform(d)

clusterer = hdbscan.HDBSCAN(metric='precomputed',min_cluster_size=2)
clusterer.fit_predict(s)

df['labels'] = clusterer.labels_

print(df.sort_values('labels'))

我认为这仍然是完美的（这是我第一次尝试文本聚类......）。您还可以为 CountVectorizer 添加您自己的停用词列表，这将是一种帮助算法的方法。至少，它可以帮助您在使用之前的函数之前对数据集进行预聚类，例如：

df.groupby('labels')['resolution'].apply(cluster_resolution)

（这样，如果您的第一次聚类大致没问题，您将只根据集群中的所有其他值检查每个值，而不是所有值）。

感谢@anon01 计算this answer 中的距离矩阵，它的结果似乎比 hdbscan 的默认值略好。

编辑：

另一个尝试，包括：

指标的变化，
使用 TF-IDF 模型添加一个步骤，
并添加了使用 nltk 包对单词进行词形还原的步骤。

所以这将是：

from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import pdist,squareform
import pandas as pd
import hdbscan
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet

d = {...}
df = pd.DataFrame(d)

lemmatizer = WordNetLemmatizer()

def lemmatization(sentence):
    
    tag_dict = {
                "J": wordnet.ADJ,"N": wordnet.NOUN,"V": wordnet.VERB,"R": wordnet.ADV,}

    # Tokenize the sentence
    wordsList = nltk.word_tokenize(sentence) 
    
    # Find the right token
    tagged = nltk.pos_tag(wordsList)   
    
    # Convert the list of (token,tag) to lemmatized tokens
    lems = [
            lemmatizer.lemmatize(token,tag_dict.get(tag[0],wordnet.NOUN) )
            for token,tag
            in tagged
            ]

    lems = ' '.join(lems)
    return lems

df['lemmatized'] = df['resolution'].apply(lemmatization)

corpus = df['lemmatized']
pipe = Pipeline(
        [
                ('cv',CountVectorizer(stop_words="english")),('tfid',TfidfTransformer())
         ]).fit(corpus)

transformed = pipe.transform(corpus)
transformed = pd.DataFrame(
        transformed.toarray(),columns=pipe.named_steps['cv'].get_feature_names(),index=df['resolution'])

d = pdist(transformed,metric="cosine") * transformed.shape[1]
s = squareform(d)

clusterer = hdbscan.HDBSCAN(metric="precomputed",min_cluster_size=2)
clusterer.fit_predict(s)

df['labels'] = clusterer.labels_

print(df.sort_values('labels'))

您还可以添加一些特定的代码，因为您的示例似乎是关于非常具体的维护日志。

例如，您可以根据一小部分硬件/软件向 transformed 数据帧添加新功能：

#To create a feature about OS :
cols = ['os','linux','window']
transformed[cols[0]] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))

#To crate a feature about hardware :
cols = ["laptop","printer","scanner"]
transformed["hardware"] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))

此步骤可能有助于获得更好的结果，但可能不是必需的。我不确定它与 FuzzyWuzzy 在匹配字符串方面的性能相比如何，但我会对您的反馈感兴趣！

加速用于比较句子的比较函数

如何解决加速用于比较句子的比较函数

解决方法

相关推荐