在Pandas中进行数据操作：根据列中的值创建一个布尔列，然后再填充另一列中的值

如何解决在Pandas中进行数据操作：根据列中的值创建一个布尔列，然后再填充另一列中的值

好的，我已经尝试了太长时间了，该去寻求帮助了。我有一个看起来像这样的数据框：

  person  fruit   quantity    all_fruits
0 p1      grapes  2           [grapes,banana]
1 p1      banana  1           [grapes,banana]
2 p2      apple   4           [apple,banana,peach]
3 p2      banana  4           [apple,peach]
4 p2      peach   2           [apple,peach]
5 p3      grapes  1           [grapes]
6 p4      banana  1           [banana]
7 p5      apple   3           [apple,peach]
8 p5      peach   2           [apple,peach]

然后，我列出了“感兴趣的水果”：

fruits_of_interest：['apple'，'banana']

我需要做的是：

为每个感兴趣的水果创建一列，并为第1列上的每个人（人）分配她是否有该水果
对于第1列中的每个人，在该水果的列下分配该人拥有的感兴趣的水果数量的log（1 + x）

我正在努力做到这一点！我的实际数据帧非常大，几乎有80万行，并且“感兴趣的水果”列表中有300多个“水果”，这无济于事。

在第一部分中，我使用了此功能，并且可以使用布尔值获取所有列，以了解是否有水果：

def has_fruit(fruit,row):
        one_string = '\t'.join(row)
        return fruit in one_string

def process_fruits(df,fruits_of_interest):
    for fruit in fruits_of_interest:
        df[fruit] = [has_fruit(fruit,x) for x in df['all_fruits']]
    return df

我需要分配值的第二部分是我根本无法工作的部分！我已经尝试使用此其他功能一次完成所有操作，但并没有完全按照要求进行操作：

def process_fruits2(df,fruits_of_interest):
    for fruit in fruits_of_interest:
        if [has_fruit(fruit,x) for x in df['all_fruits']]:
            df[fruit] = np.log1p(df.loc[df['fruit'] == fruit].quantity)

    return df

我在做什么错，我该怎么做？

添加预期输出：

这将是一个像这样的数据帧（类似于下面的答案，但仅包含列表fruits_of_interest中的水果）：

person  apple     banana                                        
p1      0.000000  0.693147
p2      1.609438  1.609438
p3      0.000000  0.000000
p4      0.000000  0.693147
p5      1.386294  0.000000

解决方法

这是一种方法。我用人（行）对水果（列）创建了一个数据透视表：

from io import StringIO
import numpy as np
import pandas as pd

# create data frame
data = '''person  fruit   quantity
p1      grapes  2
p1      banana  1
p2      apple   4
p2      banana  4
p2      peach   2
p3      grapes  1
p4      banana  1
p5      apple   3
p5      peach   2
'''
df = pd.read_csv(StringIO(data),sep='\s+',engine='python')

计算数据透视表和日志（1 + x）：

# create summary table: person x fruit
df = df.pivot_table(index='person',columns='fruit',values='quantity',aggfunc=sum,fill_value=0)

# compute log(1 + fruit)
print(df,end='\n\n')
print(np.log(1 + df))

fruit   apple  banana  grapes  peach
person                              
p1          0       1       2      0
p2          4       4       0      2
p3          0       0       1      0
p4          0       1       0      0
p5          3       0       0      2

fruit      apple    banana    grapes     peach
person                                        
p1      0.000000  0.693147  1.098612  0.000000
p2      1.609438  1.609438  0.000000  1.098612
p3      0.000000  0.000000  0.693147  0.000000
p4      0.000000  0.693147  0.000000  0.000000
p5      1.386294  0.000000  0.000000  1.098612