如何解决大数据中的统计微积分设置错误值
我正在使用以下代码制作一个我打算在大数据集上使用的函数的小示例。
我为每个以月为单位的 ID
递增地计算统计特征。
df = pd.DataFrame([[58685991,'2020-06-01',2],[58685991,1],0],'2020-12-05',7],[57839709,'2020-12-01',5],'2021-01-08',3]],columns=['ID','DATE','QTD'])
def monthdelta(a,b):
a1,a2,a3 = (int(k) for k in a.split('-'))
b1,b2,b3 = (int(k) for k in b.split('-'))
return (a1*12+a2) - (b1*12+b2)
startdate = {}
sums = {}
sumsqs = {}
num = {}
stdev = []
means = []
total = []
ind_max = []
ind_min = []
ind_maximum = 0
ind_minimum = 0
for row in df.T.iteritems():
id = row[1]['ID']
if id not in startdate:
num[id] = 1
startdate[id] = row[1]['DATE']
sums[id] = row[1]['QTD']
sumsqs[id] = row[1]['QTD'] * row[1]['QTD']
means.append( row[1]['QTD'] )
total.append( row[1]['QTD'] )
stdev.append( 0 )
ind_maximum = row[1]['QTD']
ind_minimum = row[1]['QTD']
ind_min.append( row[1]['QTD'] )
ind_max.append( row[1]['QTD'] )
else:
num[id] += 1
sums[id] += row[1]['QTD']
sumsqs[id] += row[1]['QTD'] * row[1]['QTD']
delta = monthdelta(row[1]['DATE'],startdate[id]) + 1
means.append( sums[id] / delta )
total.append( sums[id] )
if delta == 1:
stdev.append( 0 )
else:
stdev.append( np.sqrt((delta*sumsqs[id] - sums[id]*sums[id])/delta))
if row[1]['QTD'] > ind_maximum:
ind_max.append( row[1]['QTD'] )
ind_maximum = row[1]['QTD']
else:
ind_max.append( ind_maximum )
if row[1]['QTD'] < ind_minimum:
ind_min.append( row[1]['QTD'] )
ind_minimum = row[1]['QTD']
else:
ind_min.append( ind_minimum )
df['MEAN'] = pd.Series(means)
df['STDEV'] = pd.Series(stdev)
df['TOTAL'] = pd.Series(total)
df['MAX'] = pd.Series(ind_max)
df['MIN'] = pd.Series(ind_min)
ID DATE QTD MEAN STDEV TOTAL MAX MIN
0 58685991 2020-06-01 2 2.000000 0.000000 2 2 2
1 58685991 2020-06-01 1 3.000000 0.000000 3 2 1
2 58685991 2020-06-01 0 3.000000 0.000000 3 2 0
3 58685991 2020-12-05 7 1.428571 6.301927 10 7 0
4 57839709 2020-12-01 5 5.000000 0.000000 5 5 5
5 57839709 2021-01-08 3 4.000000 1.414214 8 5 3
我遇到的问题是,当我将其应用于大数据集时,某些 ID's
得到错误的特征值,我似乎无法理解为什么?有些只有一个条目和一个 QTD
,但其平均值高于 1.0,而且总数也非常高。其他功能也会出现此问题。
不确定是不是因为我使用了一个系列,然后决定在数据框上创建一列。
有没有办法通过用 .loc
和 .iloc
操作数据框本身来完成?这会是一种更安全的数据处理方式吗?我对他们不太满意,所以如果能提供一个例子会很棒。
解决方法
以下是以矢量化形式实现的相同逻辑(通常在大型数据集上更有效):
# convert DATE to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
# calculate min,max,sum
df[['min','max','sum']] = (
df
.groupby('ID')['QTD']
.expanding()
.agg(['min','sum'])
.reset_index('ID',drop=True))
# calculate delta
df['date_first'] = df.groupby('ID')['DATE'].transform('min')
df['delta'] = (
(df['DATE'].dt.year - df['date_first'].dt.year) * 12 +
(df['DATE'].dt.month - df['date_first'].dt.month) + 1)
# calculate sum of squares
df['qtd_sq'] = df['QTD']**2
df['sum_sq'] = df.groupby('ID')['qtd_sq'].cumsum()
# calculate standard deviation
df['stdev'] = np.where(
df['delta']==1,np.sqrt((df['delta']*df['sum_sq'] - df['sum']*df['sum']) / df['delta']))
# calculate means
df['means'] = df['sum'] / df['delta']
# drop temp columns
df = df.drop(columns=['delta','qtd_sq','sum_sq','date_first'])
df
输出:
ID DATE QTD min max sum stdev means
0 58685991 2020-06-01 2 2.0 2.0 2.0 0.000000 2.000000
1 58685991 2020-06-01 1 1.0 2.0 3.0 0.000000 3.000000
2 58685991 2020-06-01 0 0.0 2.0 3.0 0.000000 3.000000
3 58685991 2020-12-05 7 0.0 7.0 10.0 6.301927 1.428571
4 57839709 2020-12-01 5 5.0 5.0 5.0 0.000000 5.000000
5 57839709 2021-01-08 3 3.0 5.0 8.0 1.414214 4.000000
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。