如何使用fuzzywuzzy根据dataframe1对dataframe2进行排序

如何解决如何使用fuzzywuzzy根据dataframe1对dataframe2进行排序

我知道这是个老问题，事实上我已经看到很多与我的问题相关的链接：

Using fuzzywuzzy to create a column of matched results in the data frame

How to compare a value in one dataframe to a column in another using fuzzywuzzy ratio

What's the best way to use fuzzywuzzy to compare each value of a column with all the values of a separate dataframe's column?

但是我没有得到任何适当的解决方案

下面是我的代码：

g = [{'column1': 'ryzen 5 5600'},{'column1':'ram 8 gb ddr4 3.2ghz'},{'column2':'SSD 
                                     220gb'},{'column3':'windows 10 prof'},{'column2':'ryzen 5 3600'},{'column1':'ram 16 gb ddr4'}]

df1=pd.read_excel('product1.xlsx',header=None,index_col=False)
s = []
for l in df1.values:
    l = ','.join(l)
    s.append(l)

s = ','.join(s)    

MIN_MATCH_score = 30
guessed_word = [d for d in g if fuzz.token_set_ratio(s,list(d.values())[0]) >= 30]

product1 包含：

0   GB ddr4
1   HDD 256GB
2   SSD
3   ryzen 5
4   Win 10 Pro

guessed_word 包含：

#gives good output

[{'column1': 'ryzen 5 5600'},{'column1': 'ram 8 gb ddr4 3.2ghz'},{'column2': 'SSD 220gb'},{'column3': 'windows 10 prof'},{'column2': 'ryzen 5 3600'},{'column1': 'ram 16 gb ddr4'}]

附加到数据框后：

df3 = pd.Dataframe(guessed_word)

df3 包含：

column1                  column2          column3
     
ryzen 5 5600             SSD 220gb        windows 10 prof
ram 8 gb ddr4 3.2ghz     ryzen 5 3600
ram 16 gb ddr4

但我想要以下输出：

#product1              column1                                column2          column3
0   GB ddr4            ram 8 gb ddr4 3.2ghz,ram 16 gb ddr4   NAN               NAN
1   HDD 256GB          NAN                                    NAN               NAN 
2   SSD                NAN                                    SSD 220gb               NAN
3   ryzen 5            ryzen 5 5600                           ryzen 5 3600      NAN
4   Win 10 Pro         NAN                                    NAN        windows 10 prof

是否可以使用 df.sort_values 或其他任何东西进行排序？我试过了，但没有一个工作。

解决方法

代码有点长，但它完全符合您的预期。

import re
import pandas as pd
#from fuzzywuzzy import fuzz,process

class CustomMatcher:
    def add_space_before_numbers(self,text):
        return (re.sub(r'([0-9\.]+)',r' \1',text)).replace('  ',' ')

    def add_space_before_numbers(self,text):
        return re.sub(r'([0-9\.]+)',text)

    def add_space_after_numbers(self,text):
        return re.sub(r'([0-9\.]+)([^0-9\.])',r'\1 \2',text)

    def pad_spaces(self,text):
        result = self.add_space_before_numbers(text)
        result = self.add_space_after_numbers(result)
        return result.replace('  ',' ')
        
    def partial_word_score(self,word1,word2):
        score = 0
        len1 = len(word1)
        len2 = len(word2)
        if len2 > len1:
            temp = word2
            word2 = [*word1]
            word1 = [*temp]
        else:
            word2 = [*word2]
            word1 = [*word1]

        for i,char in enumerate(word2):
            if word1[i] == char:
                score = score + 1
        if min(len1,len2) != 0:
            return (score*100) / min(len1,len2)
        else:
            return 0

    def match(self,comparand,target):
        len_c = len(comparand)
        len_t = len(target)
        comparand_words = self.pad_spaces(comparand.lower()).split(' ')
        target_words = self.pad_spaces(target.lower()).split(' ')

        complete_score = 0
        for t_word in target_words:
            for c_word in comparand_words:
                len1 = len(t_word)
                len2 = len(c_word)
                word_score = self.partial_word_score(t_word,c_word)\
                    * (min(len1,len2) / min(len_c,len_t))
                complete_score = complete_score + word_score
        return complete_score

search_array = [
    {'column1': 'ryzen 5 5600'},{'column1': 'ram 8 gb ddr4 3.2ghz'},{'column2': 'SSD 220gb'},{'column3': 'windows 10 prof'},{'column2': 'ryzen 5 3600'},{'column1': 'ram 16 gb ddr4'}
]

search_dict = {}
for entry in search_array:
    key = [*entry][0]
    value = entry[key]
    if key in [*search_dict]:
        search_dict[key].append(value)
    else:
        search_dict[key] = [value]

filename = 'product1.xlsx'
products_sheet = pd.read_excel(filename,header=None,index_col=False)
#word_set = ','.join([x[0] for x in products_sheet.values.tolist()])
#MIN_MATCH_SCORE = 30
products_list = [x[0] for x in products_sheet.values.tolist()]
# Column #1
result_data = {}
result_data[filename.replace('.xlsx','')] = products_list

# Initialize columns #2-#n and populate it with placeholder values
columns = [*search_dict]
for column in columns:
    result_data[column]=list(range(products_list.__len__()))

for row_no,row in enumerate(products_list):
    for column in columns:
        matched_products_list=[]
        for product in search_dict[column]:
            print(f'Comparing {row} to {product} is:\t',end='')
            cm = CustomMatcher()
            matching_score = cm.match(row,product)
            if matching_score>50:
            #if fuzz.token_set_ratio(row,product)>25:
                print(matching_score,' accepted')
                matched_products_list.append(product)
            else:
                print (matching_score,' rejected')
        if (matched_products_list != []):
            result_data[column][row_no] = matched_products_list
        else:
            result_data[column][row_no] = 'NAN'

result_df = pd.DataFrame(data=result_data)
print(result_df)

注意事项：

我创建了一个 CustomMatcher 而不是使用这个 fuzzywuzzy 的东西，它太疯狂了，无法有一个有意义的阈值水平来过滤。 CustomMatcher 在计算分数时基于单词，但基于比较的字母。在用空格填充数字后，它将数字隔离为要匹配的单词。这 50 多行可以通过函数 CustomMatcher.match(word1,word2) 轻松访问，我使用 matching_score>50 作为您的应用程序匹配的合理敏感度阈值。
您不需要在单个单元格中的条目之间定义连接，相反，我使用了可以轻松访问每个单元的列表。
输出被打包为一个熊猫数据框。

谢谢，

如何使用fuzzywuzzy根据dataframe1对dataframe2进行排序

如何解决如何使用fuzzywuzzy根据dataframe1对dataframe2进行排序

解决方法

相关推荐