Stats 模型对特征已转换的新数据进行样本外预测

如何解决Stats 模型对特征已转换的新数据进行样本外预测

我很好奇为什么我无法得出模型预测的相同值。

考虑以下模型。我试图了解功能保险费用、年龄以及客户是否吸烟之间的关系。

通知年龄变量已经过预处理（均值居中）。

import pandas as pd
import statsmodels.formula.api as smf

insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
model1 = smf.ols('charges~I(age - np.mean(age)) * smoker',data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0,b2,b1,b3 = params['Intercept'],params['smoker[T.yes]'],params['I(age - np.mean(age))'],params['I(age - np.mean(age)):smoker[T.yes]']
x1 = (insurance['age'] - np.mean(insurance['age']))
# two lines with diff intercept and slopes
y_hat_non = b0 + b1 * x1 
y_hat_smok = (b0  + b2) + (b1 + b3) * x1

现在，当我生成新数据并应用预测方法时，我会在尝试手动计算这些值时得出不同的值。以索引 0 和索引 2 为例，我希望预测值与下面的输出相似，但这些确实相差甚远。

我是否遗漏了拟合模型时所做的特征转换？

new_data = pd.DataFrame({'age': {0: 19,1: 41,2: 43},'smoker': {0: 'yes',1: 'no',2: 'no'}})

idx_0 = (b0+b2) + (b1+b3) * 19
# 38061.1
idx_2 = b0 + b1 * 43
# 19878.4

fit1.predict(new_data)
0    27581.276650
1    10168.273779
2    10702.771604

解决方法

我想您想将年龄居中 variable ，此 I(age - np.mean(age)) 有效，但是当您尝试预测时，它会根据预测数据框中的平均值再次重新评估年龄。

此外，当您乘以系数时，您必须乘以中心值（即年龄 - 平均值（年龄））而不是原始值。

用居中的年龄创建另一个列没有什么坏处：

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_std=False)

insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
insurance['age_c'] = sc.fit_transform(insurance[['age']])

model1 = smf.ols('charges~age_c * smoker',data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0,b2,b1,b3 = params['Intercept'],params['smoker[T.yes]'],params['age_c'],params['age_c:smoker[T.yes]']

并且您可以通过将之前的缩放器用于年龄列来进行预测：

new_data = pd.DataFrame({'age': {0: 19,1: 41,2: 43},'smoker': {0: 'yes',1: 'no',2: 'no'}})

new_data['age_c'] = sc.transform(new_data[['age']])

new_data

   age smoker      age_c
0   19    yes -20.207025
1   41     no   1.792975
2   43     no   3.792975

检查：

idx_0 = (b0+b2) + (b1+b3) * -20.207025
# 26093.64269247414
idx_2 = b0 + b1 * 3.792975
9400.282805032146

fit1.predict(new_data)
Out[13]: 
0    26093.642567
1     8865.784870
2     9400.282695