多元二阶多项式回归python

如何解决多元二阶多项式回归python

我正在处理多元回归问题。我的数据集类似于 X = (nsample,nx) 和 Y = (nsample,ny)。 nx 和 ny 可能会因要研究的不同案例的不同数据集而有所不同，因此它们在代码中应该是通用的。

我想确定最小化均方根误差的多元多项式回归的系数。我想将问题拆分为 ny 不同的回归，因此对于每个回归，我的数据集是 X = (nsample,1)。因此，对于每个因变量 (Uj)，二阶多项式具有以下形式：

我在 python 中将函数编码为：

def func(x,nx,pars0,pars1,pars2):
  y = pars0 #pars0 = bias
  for i in range(nx):
    y = y + pars1[i]*x[i] #pars1 linear coeff (beta_i in the equation)
    for j in range(nx):
        if (j < i ):
            continue
        y = y + pars2[i,j]*x[i]*x[j] 
        #diag pars2 =  coeff of x^2 (beta_ii in the equation)
        #upper triangle pars2 = coeff of x_i*x_k (beta_ik in the equation)
  return y

和均方根误差为：

def resid(nsample,pars2,x,y):
  res=0.0
  for i in range(nsample):
    y_pred = func(nx,x[i])
    res=res+((y_pred - y[i]) ** 2)
  res=res/nsample
  res=res**0.5
  return res

为了确定系数，我想使用 scipy.optmize.minimize 但它不起作用 example_1 example_2。任何想法或建议？我应该使用 sklearn 吗？

-> 编辑：玩具测试数据 nx =3,ny =1

0.20    -0.02   0.20    1.0229781
0.20    -0.02   0.40    1.0218807
0.20    -0.02   0.60    1.0220439
0.20    -0.02   0.80    1.0227083
0.20    -0.02   1.00    1.0237960
0.20    -0.02   1.20    1.0255770
0.20    -0.02   1.40    1.0284888
0.20    -0.06   0.20    1.0123552
0.24    -0.02   1.40    1.0295350
0.24    -0.06   0.20    1.0125935
0.24    -0.06   0.40    1.0195798
0.24    -0.06   0.60    1.0124632
0.24    -0.06   0.80    1.0131748
0.24    -0.06   1.00    1.0141751
0.24    -0.06   1.20    1.0153533
0.24    -0.06   1.40    1.0170036
0.24    -0.10   0.20    1.0026915
0.24    -0.10   0.40    1.0058125
0.24    -0.10   0.60    1.0055921
0.24    -0.10   0.80    1.0057868
0.24    -0.10   1.00    1.0014004
0.24    -0.10   1.20    1.0026257
0.24    -0.10   1.40    1.0024578
0.30    -0.18   0.60    0.9748765
0.30    -0.18   0.80    0.9753220
0.30    -0.18   1.00    0.9740970
0.30    -0.18   1.20    0.9727272
0.30    -0.18   1.40    0.9732258
0.30    -0.20   0.20    0.9722360
0.30    -0.20   0.40    0.9687567
0.30    -0.20   0.60    0.9676569
0.30    -0.20   0.80    0.9672319
0.30    -0.20   1.00    0.9682354
0.30    -0.20   1.20    0.9674461
0.30    -0.20   1.40    0.9673747
0.36    -0.02   0.20    1.0272033
0.36    -0.02   0.40    1.0265790
0.36    -0.02   0.60    1.0271688
0.36    -0.02   0.80    1.0277286
0.36    -0.02   1.00    1.0285388
0.36    -0.02   1.20    1.0295619
0.36    -0.02   1.40    1.0310734
0.36    -0.06   0.20    1.0159603
0.36    -0.06   0.40    1.0159753
0.36    -0.06   0.60    1.0161890
0.36    -0.06   0.80    1.0153346
0.36    -0.06   1.00    1.0159790
0.36    -0.06   1.20    1.0167520
0.36    -0.06   1.40    1.0176916
0.36    -0.10   0.20    1.0048287
0.36    -0.10   0.40    1.0034699
0.36    -0.10   0.60    1.0032798
0.36    -0.10   0.80    1.0037224
0.36    -0.10   1.00    1.0059301
0.36    -0.10   1.20    1.0047114
0.36    -0.10   1.40    1.0041287
0.36    -0.14   0.20    0.9926268
0.40    -0.08   0.80    1.0089013
0.40    -0.08   1.20    1.0096265
0.40    -0.08   1.40    1.0103305
0.40    -0.10   0.20    1.0045464
0.40    -0.10   0.40    1.0041031
0.40    -0.10   0.60    1.0035650
0.40    -0.10   0.80    1.0034553
0.40    -0.10   1.00    1.0034699
0.40    -0.10   1.20    1.0030276
0.40    -0.10   1.40    1.0035284
0.40    -0.10   1.60    1.0042166
0.40    -0.14   0.20    0.9924336
0.40    -0.14   0.40    0.9914971
0.40    -0.14   0.60    0.9910082
0.40    -0.14   0.80    0.9903772
0.40    -0.14   1.00    0.9900816

解决方法

最小化错误是一个巨大而复杂的问题。因此，很多非常聪明的人想出了很多很酷的解决方案。以下是一些：

（在所有这些中，我认为 bayesian optimization with sklearn 可能是您用例的不错选择，尽管我从未使用过）

（另外，删除图片网址中最后一个“s”以查看完整尺寸）

随机方法：

genetic algorithms：像基因组中的染色体一样格式化您的问题，并“培育”出最佳解决方案（我个人最喜欢的）

simulated anealing：格式化您的问题，就像热金属被退火一样，试图在失去热量的同时进入稳定状态

random search：比听起来好。随机测试输入变量的真实性。

Grid Search：易于实现，但通常不如采用真正随机性的方法有效（沿着特定的兴趣轴进行重复探索。这种策略通常会浪费计算资源）

其中很多都出现在 hyperparameter optimization 中，用于 ML 模型。

更规范的方法：

Gradient Descent：使用在可微函数中计算的梯度向局部最小值迈进

DeepAR：使用贝叶斯优化，结合随机搜索，减少超参数调整的损失。虽然我相信这仅在 AWS 上可用，但它看起来像 sklearn has an implementation of Bayesian optimization

scipy.optimize.minimize：我知道您已经在使用它了，但是通过更改 method 标志可以使用 15 种不同的算法。

摩擦

虽然误差最小化在概念上很简单，但在实践中，高维空间中的复杂误差拓扑很难有效地遍历。它涉及局部和全局极值、explore/exploit 问题，以及我们对计算复杂性的数学理解。通常，通过结合对问题的透彻理解以及对多种算法和超参数的实验，可以实现良好的错误减少。在机器学习中，这通常被称为超参数调整，如果您愿意，它是一种“元”错误减少步骤。

注意：欢迎推荐更多优化方法，我会添加到列表中。

我有一个使用模拟退火的示例，如该线程中的 nice 列表所述。

首先，我需要加载数据并定义目标函数。我将您的数据保存在 data.csv 中并加载了

import pandas as pd
data = pd.read_csv("../data.csv",sep="   ",header=None,engine='python')

并使用

获取您的值

X = data[ [0,1,2] ].values
Y = data[ 3 ].values

我用

定义了你的poly函数

from itertools import combinations

def poly_function(X,beta):
    X_dimension = X.shape[1]

    i,j = zip( *list(combinations( range(X_dimension),2)) )
    X_cross = X[:,i] * X[:,j] 
    X_expanded = np.concatenate([X,X**2,X_cross],axis=1)
    
    assert X_expanded.shape[1] == beta.shape[0],"Expect beta to be of size {}".format(X_expanded.shape[1])
    
    return np.matmul(X_expanded,beta)

对于模拟退火，我们只需要客观

def obj(beta,X=X,Y=Y):
    
    Y_hat = poly_function(X,beta)
    
    BOOSTER = 10**5
    
    return BOOSTER * np.mean( (Y-Y_hat)**2 )**.5

和一些建议

def small_delta(beta):
    new_beta = beta.copy()
    
    random_index = np.random.randint(0,new_beta.shape[0])
    
    new_beta[ random_index ] += (np.random.random() - .5) * .01
 
    return new_beta

def large_delta(beta):
    new_beta = beta.copy()
    
    random_index = np.random.randint(0,new_beta.shape[0])
    
    new_beta[ random_index ] += np.random.random() - .5 
 
    return new_beta

随机开始

def random_beta():
    return np.random.random(size=9)

和SA与

import frigidum


local_opt = frigidum.sa(random_start=random_beta,neighbours=[small_delta,large_delta],objective_function=obj,T_start=10**2,T_stop=10**-12,repeats=10**3,copy_state=frigidum.annealing.copy)

我在您的数据中发现的 RMSE 大约为 0.026254，测试版

array([ 7.73168440e+00,2.93929578e+00,4.10133180e-02,-1.37266444e+01,-3.43978686e+00,-1.12816177e-02,-1.00262307e+01,-3.12327590e-02,9.07369588e-02])

您需要知道的地方是 (X1,X2,X3,X1**2,X2**2,X3**2,X1*X2,X1*X3,X2*X3)

更长的重复次数可能会给我带来 0.026150 的 beta 错误

array([ 7.89212770e+00,3.24138652e+00,1.24436937e-02,-1.41549553e+01,-3.31912739e+00,-5.54411310e-03,-1.08317125e+01,2.09684769e-02,6.84396750e-02])

您可以尝试将 statsmodels 库与此链接中的解释相结合，以拟合多项式模型。 https://ostwalprasad.github.io/machine-learning/Polynomial-Regression-using-statsmodel.html

经过反复试验，我终于想出了一个解决方案。使用变量的变化可以将问题视为线性问题。我使用 scikit-learn 来构建模型。经过对真实案例的一些测试，效果非常好