在多行 DataFrame 上应用自定义函数

如何解决在多行 DataFrame 上应用自定义函数

我目前正在开发一个函数，该函数将根据多种条件（平方米、图像和价格）检测该行是否重复。它工作得很好，直到找到重复项，从 DataFrame 中删除该行，然后我的 for 循环受到干扰。这会产生 IndexError: single positional indexer is out-of-bounds。

def image_duplicate(df):
    # Detecting duplicates based on the publications' images,m2 and price.
    for index1 in df.index:
        for index2 in df.index:
            if index1 == index2:
                continue
            print("index1: {} \t index2: {}".format(index1,index2))

            img1 = Image.open(requests.get(df["img_url"].iloc[index1],stream=True).raw).resize((213,160))
            img2 = Image.open(requests.get(df["img_url"].iloc[index2],160))
            img1 = np.array(img1).astype(float)
            img2 = np.array(img2).astype(float)

            ssim_result = ssim(img1,img2,multichannel=True)

            ssim_result_percentage = (1 + ssim_result) / 2

            if (
                ssim_result_percentage > 0.80
                and df["m2"].iloc[index1] == df["m2"].iloc[index2]
                and df["Price"].iloc[index1] == df["Price"].iloc[index2]
            ):
                df.drop(df.iloc[index2],inplace=True).reindex()


image_duplicate(full_df)

什么是解决这个问题的好方法？

编辑：示例：

预期输出：从 DataFrame 中移除 One bedroom 行 [2]。

解决方法

从您的问题看来（如果我错了，请纠正我）您需要迭代索引（笛卡尔积）并从原始数据框中删除第二个索引（在您的示例中为 index2）。>

我会推荐这样的东西来解决您的问题：

import itertools

def image_duplicate(df):
    # Detecting duplicates based on the publications' images,m2 and price.
    indexes_to_drop = []
    for index1,index2 in itertools.product(df.index,df.index):
        if index1 == index2:
            continue
        print("index1: {} \t index2: {}".format(index1,index2))

        img1 = Image.open(requests.get(df["img_url"].iloc[index1],stream=True).raw).resize((213,160))
        img2 = Image.open(requests.get(df["img_url"].iloc[index2],160))
        img1 = np.array(img1).astype(float)
        img2 = np.array(img2).astype(float)

        ssim_result = ssim(img1,img2,multichannel=True)
        ssim_result_percentage = (1 + ssim_result) / 2

        if (
            ssim_result_percentage > 0.80
            and df["m2"].iloc[index1] == df["m2"].iloc[index2]
            and df["Price"].iloc[index1] == df["Price"].iloc[index2]
        ):
            indexes_to_drop.append(index2)

    indexes_to_drop = list(set(indexes_to_drop))
    return df.drop(indexes_to_remove)


output_df = image_duplicate(full_df)  # `output_df` should contain the expected output

说明：

遍历索引（在这种情况下我更喜欢使用 itertools，但也可以随意使用 for 循环的方法）
创建 indexes_to_drop 列表，而不是在末尾删除，而是将这些索引附加到列表中
获取要删除的唯一索引列表（可能会发生相同的索引将在列表中重复出现） - list(set(indexes_to_drop)) 是如何删除重复项的简单方法（集合不能包含重复项）
立即删除这些索引（不确定您为什么在示例中使用 .reindex）

可能还有其他方法可以改进您的代码，例如不要比较 index2 已经在 indexes_to_drop 列表中的图像（例如检查 if index2 in indexes_to_drop 并在 True 时继续），或者您甚至可以将其转换为可与 {{1} 一起使用的函数} （对 apply 的迭代会在 apply 内部发生），但这不是必需的。

在多行 DataFrame 上应用自定义函数

如何解决在多行 DataFrame 上应用自定义函数

解决方法

相关推荐