微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

查找与python中的特定条件匹配的重复项

以下是我正在处理的示例数据.

sender  receiver    date    id
salman  akhtar  20161201    1111
akhtar  salman  20161201    1112
nabeel  ahmed   20161201    1113
salman  akhtar  20161201    1114
salman  akhtar  20161202    1115
nabeel  ahmed   20161202    1116
ahmed   nabeel  20161202    1117
nabeel  ahmed   20161202    1118
nabeel  ahmed   20161202    1119

我想要实现的是在相同的日期内根据条件,相同的发送者和相同的接收者找到重复的条目.

为此,我编写了以下代码.

import pandas as pd
import xlsxwriter

print 'Script for Finding duplicate entries\n'

path = raw_input('Enter file name: ')
print 'Loading file. Please wait...'

xlsx = pd.ExcelFile(path+'.xlsx')

print 'File loaded successfully.\n'
sheet = raw_input('Enter Sheet Name: ')
df = pd.read_excel(xlsx, sheet)

df['is_duplicated'] = df.duplicated(['sender', 'receiver','date'],keep=False)

df_dup = df.loc[df['is_duplicated'] == True]

print 'Found Below Duplicates'
print df_dup

writer = pd.ExcelWriter("pandas_column_formats.xlsx", engine='xlsxwriter')
df_dup.to_excel(writer, sheet_name='Sheet1')

writer.save()

print 'File created successfully.'

现在,我想要包含fuzzywuzzy,因为当前代码只返回EXACT重复项,并且我希望所有可能的重复行基于上述条件.

有人可以帮忙吗?

解决方法:

像这样的东西?

>>> fuzz_ratio = 50
>>> df_rem = df[~df.duplicated(['sender', 'receiver','date'],keep=False)]
>>> df_possible_dup = pd.merge(df_rem, df, on='date', suffixes=['', '_j'])
>>> df_possible_dup.apply(lambda x: fuzz.ratio(x['sender'], x['sender_j']) >= 50 and x['id'] != x['id_j'], axis=1)

我不知道您的确切要求,但可能您想检查发送方或接收方是否完全匹配,其他部分是否可能匹配.然后你可以使用你的自定义功能

def worker(x, fuzz_ratio):
    if x['id'] != x['id_j']:
        return False

    if x['sender'] == x['sender_j'] and fuzz.ratio(x['receiver'], x['receiver_j']) > fuzz_ratio:
        return True

    if x['receiver'] == x['receiver_j'] and fuzz.ratio(x['sender'], x['sender_j']) > fuzz_ratio:
        return True

    return False

>>> df_possible_dup.apply(lambda x: worker(x, fuzz_ratio))

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐