如何对不同程度的多项式使用GridSearchCV？

如何解决如何对不同程度的多项式使用GridSearchCV？

我想做的是遍历一些适合不同阶数多项式的OLS，以查看在给定mpg的情况下哪个阶数在预测horsepower时表现更好（同时使用LOOCV和KFold）。我编写了代码，但无法弄清楚如何使用polynomialFeatures将gridsearchcv函数应用于每次迭代，所以最终写成这样：

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import LeaveOneOut,KFold
from sklearn.preprocessing import polynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error



df = pd.read_csv('http://web.stanford.edu/~oleg2/hse/auto/Auto.csv')[['horsepower','mpg']].dropna()

pows = range(1,11)
first,second,mse = [],[],0     # 'first' is data for the first plot and 'second' is for the second one

for p in pows:
  mse = 0
  for train_index,test_index in LeaveOneOut().split(df):
      x_train,x_test = df.horsepower.iloc[train_index],df.horsepower.iloc[test_index]
      y_train,y_test = df.mpg.iloc[train_index],df.mpg.iloc[test_index]
      polynomial_features = polynomialFeatures(degree = p)
      x = polynomial_features.fit_transform(x_train.values.reshape(-1,1))   #getting the polynomial
      ft = LinearRegression().fit(x,y_train)
      x1 = polynomial_features.fit_transform(x_test.values.reshape(-1,1))   #getting the polynomial
      mse += mean_squared_error(y_test,ft.predict(x1))
  first.append(mse/len(df))
    
for p in pows: 
    temp = []   
    for i in range(9):      # this is to plot a few graphs for comparison
        mse = 0
        for train_index,test_index in KFold(10,True).split(df):
            x_train,df.horsepower.iloc[test_index]
            y_train,df.mpg.iloc[test_index]
            polynomial_features = polynomialFeatures(degree = p)
            x = polynomial_features.fit_transform(x_train.values.reshape(-1,1))   #getting the polynomial
            ft = LinearRegression().fit(x,y_train)
            x1 = polynomial_features.fit_transform(x_test.values.reshape(-1,1))   #getting the polynomial
            mse += mean_squared_error(y_test,ft.predict(x1))
        temp.append(mse/10)
    second.append(temp)      


f,pt = plt.subplots(1,2,figsize=(12,5.1))
f.tight_layout(pad=5.0)
pt[0].set_ylim([14,30])
pt[1].set_ylim([14,30])
pt[0].plot(pows,first,color='darkblue',linewidth=1)
pt[0].scatter(pows,color='darkblue')
pt[1].plot(pows,second)
pt[0].set_title("LOOCV",fontsize=15)
pt[1].set_title("10-fold CV",fontsize=15)
pt[0].set_xlabel('Degree of polynomial',fontsize=15)
pt[1].set_xlabel('Degree of polynomial',fontsize=15)
pt[0].set_ylabel('Mean Squared Error',fontsize=15)
pt[1].set_ylabel('Mean Squared Error',fontsize=15)
plt.show()

它产生：

这可以正常工作，您可以在计算机上运行它以进行测试。这确实符合我的要求，但似乎确实过多。我实际上是在寻求有关如何使用gridsearchcv或其他方法进行改进的建议。我尝试将polynomialFeatures作为LinearRegression()的管道传递，但是无法即时更改x。一个工作示例将不胜感激。

解决方法

这种事情似乎是解决问题的方法：

pipe = Pipeline(steps=[
    ('poly',PolynomialFeatures(include_bias=False)),('model',LinearRegression()),])

search = GridSearchCV(
    estimator=pipe,param_grid={'poly__degree': list(pows)},scoring='neg_mean_squared_error',cv=LeaveOneOut(),)

search.fit(df[['horsepower']],df.mpg)

first = -search.cv_results_['mean_test_score']

（在最后一行为负，因为计分器的mse为负）

然后绘制可以或多或少以相同的方式进行。（我们在这里依靠cv_results_将条目按与pows相同的顺序；您可能想使用pd.DataFrame(search.cv_results_)的适当列进行绘制。）

您可以使用RepeatedKFold来模拟KFold上的循环，尽管那样您只会得到一个图。如果您确实需要单独的图，则仍然需要外部循环，但是使用cv=KFold(...)进行的网格搜索可以替换内部循环。