有没有办法修改此代码以减少运行时间?

如何解决有没有办法修改此代码以减少运行时间?

所以我希望修改代码以减少 fuzzywuzzy 库的运行时间。目前一个800行的数据集大概需要一个小时,我在4.5K行的数据集上用这个,运行了将近6个小时,还是没有结果。我不得不停止内核。

我需要在至少 20K 的数据上使用此代码。任何人都可以建议对此代码进行任何编辑以更快地获得结果吗?这是代码 -

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz,process

df = pd.read_csv(r'path')
df.head()

data = df['Body']
print(data)

clean = []
threshold = 80 
for row in data:
  # score each sentence against each other
  # [('string',score),..]
  scores = process.extract(row,data,scorer=fuzz.token_set_ratio)
  # basic idea is if there is a close second match we want to evaluate 
  # and keep the longer of the two
  if scores[1][1] > threshold:
     clean.append(max([x[0] for x in scores[:2]],key=len))
  else:
     clean.append(scores[0][0])

# remove dupes
clean = set(clean)

#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean,columns=['Body'])

clean_data.to_csv(r'path') 

这就是我的数据的样子 -

https://docs.google.com/spreadsheets/d/1p9RC9HznhdJFH4kFYdE_TgnHdoRf8P6gTEAkB3lQWEE/edit?usp=sharing

因此,如果您注意到第 14 行和第 15 行以及第 19 行和第 20 行部分重复,我希望代码能够识别此类句子,并删除较短的句子。

更新 -

我对@Darryl G 给出的快速模糊解决方案做了一个小改动,现在代码看起来像这样 -

`import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time

df = pd.read_excel(r'path')

data = df['Body']
print(data)

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data,columns=columns)


clean_rapid = []
threshold = 80 

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)       # Pre-process to make lower-case and remove non-alphanumeric 
                                                                   # characters (generator)
    processed_data = pd.Series(series)   

    for query in processed_data:
        scores = process_rapid.extract(query,processed_data,scorer=rapid_token_set_ratio,score_cutoff=threshold)
        if len(scores) > 1 and scores[1][1] > threshold:
            m = max(scores[:2],key = lambda k:len(k[0]))                # Of up to two matches above threshold,takes longest
            clean_rapid.append(m[0])                                    # Saving the match index
        else:
            clean_rapid.append(query)

################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe(r'path')   # Using Excel file in working folder

# Desired data in body column
data = df['Body'].dropna()                                           # Dropping None rows (few None rows at end after Excel import)

result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')

# remove dupes
clean_rapid = set(clean_rapid)

#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean_rapid,columns=['Body'])

#exporting the cleaned data
clean_data.to_excel(r'path')`

现在的问题是,在输出文件中,所有句号等都被删除了。我怎样才能留住他们?

解决方法

该方法基于来自 Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column 中的答案的 RapidFuzz。

结果

  • OP Fuzzy Wuzzy 方法):2565.7 秒
  • RapidFuzz 方法:649.5 秒

因此:4 倍改进

  • 注意:来自 OP Google Sheet Data 的测试数据 ~2K 记录已下载到本地 Excel 工作簿。

快速模糊实施

import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data,columns=columns)

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)       # Pre-process to make lower-case and remove non-alphanumeric 
                                                                   # characters (generator)
    processed_data = pd.Series(series)   

    clean_rapid = []
    threshold = 80 
    n = 0
    for query in processed_data:
        scores = process_rapid.extract(query,processed_data,scorer=rapid_token_set_ratio,score_cutoff=threshold)
        
        m = max(scores[:2],key = lambda k:len(k[0]))                # Of up to two matches above threshold,takes longest
        clean_rapid.append(m[-1])                                    # Saving the match index
        
    clean_rapid = set(clean_rapid)                                   # remove duplicate indexes

    return data[clean_rapid]                                         # Get actual values by indexing to Pandas Series

################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe('Duplicates1.xlsx')   # Using Excel file in working folder

# Desired data in body column
data = df['Body'].dropna()                                           # Dropping None rows (few None rows at end after Excel import)

result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')

用于比较的已发布代码版本

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz,process
import openpyxl
import time

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data,columns=columns)

def process_fuzzy_wuzzy(data):
    clean = []
    threshold = 80 
   
    for idx,query in enumerate(data):
        # score each sentence against each other
        # [('string',score),..]
        scores = process.extract(query,data,scorer=fuzz.token_set_ratio)
        # basic idea is if there is a close second match we want to evaluate 
        # and keep the longer of the two
        if len(scores) > 1 and scores[1][1] > threshold:    # If second one is close
            m = max(scores[:2],key=lambda k:len(k[0]))
            clean.append(m[-1])
        else:
            clean.append(idx)

    # remove duplicates
    clean = set(clean)
    return data[clean]                                        # Get actual values by indexing to Pandas Series

################ Testing
t0 = time.time()
# Get DataFrame for sheet from Excel
df = excel_sheet_to_dataframe('Duplicates1.xlsx')  

# Will Process data in 'body' column of DataFrame
data = df['Body'].dropna()                                    # Dropping None rows (few None rows at end after Excel import)

# Process Data (Pandas Series)
result_fuzzy_wuzzy = process_fuzzy_wuzzy(data)
print(f'Elapsed time {time.time() - t0}')
,

这回答了您问题的第二部分。 processed_data 包含预处理过的字符串,因此查询已经过预处理。默认情况下,预处理由 process.extract 完成。 DarrylG 将此预处理移动到循环的前面,因此字符串不会被多次预处理。如果您不希望在没有预处理的情况下比较字符串,您可以直接迭代原始数据: 改变:

series = (rapid_utils.default_process(d) for d in data)
processed_data = pd.Series(series)   

for query in processed_data:

for query in data:

如果您想要原始行为,但想要在结果中包含未处理的字符串,您可以使用结果字符串的索引来提取未处理的字符串。

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)
    processed_data = pd.Series(series)   

    for query in processed_data:
        scores = process_rapid.extract(query,score_cutoff=threshold,limit=2)
        m = max(scores[:2],key = lambda k:len(k[0]))
        clean_rapid.append(data[m[2]])

在实施中有一些可能的进一步改进:

  1. 您可以通过将 query 中的 processed_data 替换为 None 来确保当前 process.extractOne 不会匹配,然后使用 process.extract 查找高于阈值的下一个最佳匹配。这至少与 processed_data 一样快,而且可能明显更快。
  2. 您将 processed_data 的每个元素与 data[n] <-> data[m] 的每个元素进行比较。这意味着您始终执行比较 data[m] <-> data[n]def process_rapid_fuzz(data): ''' Process using rapid fuzz rather than fuzz_wuzzy ''' series = (rapid_utils.default_process(d) for d in data) processed_data = pd.Series(series) for idx,query in enumerate(processed_data): # None is skipped by process.extract/extractOne,so it will never be part of the results processed_data[idx] = None match = process_rapid.extractOne(query,score_cutoff=threshold) # compare the length using the original strings # alternatively len(match[0]) > len(query) # if you do want to compare the length of the processed version if match and len(data[match[2]]) > len(data[idx]): clean_rapid.append(data[match[2]]) else: clean_rapid.append(data[idx]) ,即使它们保证具有相同的结果。只执行一次比较应该可以节省大约 50% 的运行时间。
1234=A||1456=B||1789=C
1245=||1234=V
1234,1133
1456=||1234=1,234||1234

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?
Java在半透明框架/面板/组件上重新绘画。
Java“ Class.forName()”和“ Class.forName()。newInstance()”之间有什么区别?
在此环境中不提供编译器。也许是在JRE而不是JDK上运行?
Java用相同的方法在一个类中实现两个接口。哪种接口方法被覆盖?
Java 什么是Runtime.getRuntime()。totalMemory()和freeMemory()?
java.library.path中的java.lang.UnsatisfiedLinkError否*****。dll
JavaFX“位置是必需的。” 即使在同一包装中
Java 导入两个具有相同名称的类。怎么处理?
Java 是否应该在HttpServletResponse.getOutputStream()/。getWriter()上调用.close()?
Java RegEx元字符(。)和普通点?