如何解决如何在python中提高熵权法的算法效率
下面是代码,但是,处理大数据时非常慢。 (对于 5,000,000行,6列数据框,可能需要花费超过1天的时间。
想知道如何优化它?非常感谢
def ewm(df):
df = df.apply(lambda x: ((x - np.min(x)) / (np.max(x) - np.min(x))))
rows,cols = df.shape
k = 1.0 / math.log(rows)
lnf = [[None] * cols for i in range(rows)]
for i in range(0,rows):
for j in range(0,cols):
if df.iloc[i][j] == 0:
lnfij = 0.0
else:
p = df.iloc[i][j] / df.iloc[:,j].sum()
lnfij = math.log(p) * p * (-k)
lnf[i][j] = lnfij
lnf = pd.DataFrame(lnf)
d = 1 - lnf.sum(axis=0)
w = [[None] * 1 for i in range(cols)]
for j in range(0,cols):
wj = d[j] / sum(d)
w[j] = wj
w = pd.DataFrame(w)
w = w.round(5) #.applymap(lambda x:format(x,'.5f'))
w.index = df.columns
w.columns =['weight']
return w
解决方法
在获得特定值时使用iat而不是iloc 如果您执行相同的两次iloc,则将其保存在tmp中
import pandas as pd
import time
import numpy as np
import math
#original method
def ewm(df):
df = df.apply(lambda x: ((x - np.min(x)) / (np.max(x) - np.min(x))))
rows,cols = df.shape
k = 1.0 / math.log(rows)
lnf = [[None] * cols for i in range(rows)]
for i in range(0,rows):
for j in range(0,cols):
if df.iloc[i][j] == 0:
lnfij = 0.0
else:
p = df.iloc[i][j] / df.iloc[:,j].sum()
lnfij = math.log(p) * p * (-k)
lnf[i][j] = lnfij
lnf = pd.DataFrame(lnf)
d = 1 - lnf.sum(axis=0)
w = [[None] * 1 for i in range(cols)]
for j in range(0,cols):
wj = d[j] / sum(d)
w[j] = wj
w = pd.DataFrame(w)
w = w.round(5) #.applymap(lambda x:format(x,'.5f'))
w.index = df.columns
w.columns =['weight']
return w
#modified method
def ewm1(df):
df = df.apply(lambda x: ((x - np.min(x)) / (np.max(x) - np.min(x))))
rows,cols):
tmp = df.iat[i,j] #********************************* modified section
if tmp == 0:
lnfij = 0.0
else:
p = tmp / df.iloc[:,j].sum() #************************ end of modified
lnfij = math.log(p) * p * (-k)
lnf[i][j] = lnfij
lnf = pd.DataFrame(lnf)
d = 1 - lnf.sum(axis=0)
w = [[None] * 1 for i in range(cols)]
for j in range(0,'.5f'))
w.index = df.columns
w.columns =['weight']
return w
df = pd.DataFrame(np.random.rand(1000,6))
start = time.time()
ewm(df)
print(time.time()-start)
start1 = time.time()
ewm1(df)
print(time.time()-start1)
第一个功能的时间为1.9747240543365479
其秒0.820796012878418
我不确定该方法做什么 但是如果您可以将其分解为几个带有数字返回值的函数 您可以对它们进行散列并进一步改善
,具有numpy循环功能可以大大加快循环速度
import numpy as np
import pandas as pd
def ewm(df):
df = df.apply(lambda x: ((x - np.min(x)) / (np.max(x) - np.min(x))))
rows,cols = df.shape
k = 1.0 / math.log(rows)
p = df / df.sum(axis=0)
lnf = -np.log(p,where = df!=0 )*p*k
d = 1 - lnf.sum(axis=0)
w = d / d.sum()
w = pd.DataFrame(w)
w = w.round(5)
w.index = df.columns
w.columns =['weight']
return w
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。