微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

拟合期间出错:找到的输入变量样本数量不一致:

如何解决拟合期间出错:找到的输入变量样本数量不一致:

我的代码

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor

# Set option to display all the rows and columns in the dataset. If there are more rows,adjust number accordingly.
pd.set_option('display.max_rows',5000)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)

# Pandas needs you to define the column as date before its imported and then call the column and define as a date
# hence this step.
date_col = ['Date']
df = pd.read_csv(
    r'C:\Users\harsh\Documents\My Dream\Desktop\Machine Learning\Attempt1\Historical Data\Concat_Cleaned.csv',parse_dates=date_col,skiprows=0,low_memory=False)

# Converting/defining the columns
# Before you define column types,you need to fill all NaN with a value. We will be reconverting them later
df = df.fillna(101)
# Defining column types
convert_dict = {'League_Division': str,'HomeTeam': str,'AwayTeam': str,'Full_Time_Home_Goals': int,'Full_Time_Away_Goals': int,'Full_Time_Result': str,'Half_Time_Home_Goals': int,'Half_Time_Away_Goals': int,'Half_Time_Result': str,'Attendance': int,'Referee': str,'Home_Team_Shots': int,'Away_Team_Shots': int,'Home_Team_Shots_on_Target': int,'Away_Team_Shots_on_Target': int,'Home_Team_Hit_Woodwork': int,'Away_Team_Hit_Woodwork': int,'Home_Team_Corners': int,'Away_Team_Corners': int,'Home_Team_Fouls': int,'Away_Team_Fouls': int,'Home_Offsides': int,'Away_Offsides': int,'Home_Team_Yellow_Cards': int,'Away_Team_Yellow_Cards': int,'Home_Team_Red_Cards': int,'Away_Team_Red_Cards': int,'Home_Team_Bookings_Points': float,'Away_Team_Bookings_Points': float,}

df = df.astype(convert_dict)

# Reverting the replace values step to get original dataframe and with the defined filetypes
df = df.replace('101',np.NAN,regex=True)
df = df.replace(101,regex=True)

# Clean dataset by dropping null rows
data = df.dropna(axis=0)

# Column that you want to predict = y
y = data.Full_Time_Home_Goals

# Columns that are inputted into the model to make predictions (dependants),Cannot be column y
features = ['HomeTeam','AwayTeam','Full_Time_Away_Goals','Full_Time_Result']
# Create X
X = data[features]

# Split into validation and training data
train_X,val_X,train_y,val_y = train_test_split(X,y,random_state=1)

# Specify Model
soccer_model = DecisionTreeRegressor(random_state=1)

# Define and train OneHotEncoder to transform numerical data to a numeric array
enc = OneHotEncoder(handle_unkNown='ignore')
enc.fit(train_X,train_y)

transformed_train_X = enc.transform(train_X)
transformed_val_X = enc.transform(val_X)

# Fit Model
soccer_model.fit(transformed_train_X,train_y)

#  Make validation predictions and calculate mean absolute error
val_predictions = soccer_model.predict(transformed_val_X)
val_mae = mean_absolute_error(val_predictions,val_y)
print("Validation MAE when not specifying max_leaf_nodes : {:,.5f}".format(val_mae))

# Using best value for max_leaf_nodes
data_model = DecisionTreeRegressor(max_leaf_nodes=100,random_state=1)
data_model.fit(transformed_train_X,train_y)
val_predictions = data_model.predict(transformed_val_X)
val_mae = mean_absolute_error(val_predictions,val_y)
print("Validation MAE for best value of max_leaf_nodes : {:,.5f}".format(val_mae))

# Build a Random Forest model and train it on all of X and y.
# To improve accuracy,create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor()
# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(transformed_train_X,train_y)

# path to file you will use for predictions
date_col_n = ['Date']
test_data = pd.read_csv(
    r'C:\Users\harsh\Documents\My Dream\Desktop\Machine Learning\Attempt1\EPL_2021_TiMetable.csv',parse_dates=date_col_n,low_memory=False)
# columns = ['Home_Team','Away_Team']
test_data['ID'] = np.arange(len(test_data))
# Define columns we want to use for prediction
columns = ['Home_Team','Away_Team']
test_data = test_data[columns]
# Renaming Column Names to match with training dataset
test_data = test_data.rename({'Home_Team': 'HomeTeam','Away_Team': 'AwayTeam'},axis=1)
# Adding NaN columns to dataset to match the training dataset
test_data['Full_Time_Result'] = np.nan
test_data['Full_Time_Away_Goals'] = np.nan

# Aligning dataframe to model defined
test_data_features = test_data[features]
# Filling all NA values as Encoder cannot handle nan values
df = test_data.fillna(1)
# Exploration
print(df)

# Define Y for Fitting
Y = df

# We need to encode and transform dataset so we have converted all nan to 1 and we are defining a new model as the
# valx values are confusing,we will use n_
train_n_X,val_n_X,train_n_y,val_n_y = train_test_split(Y,random_state=1)
enc.fit(train_n_X,train_n_y)
transformed_train_n_X = enc.transform(train_n_X)
transformed_val_n_X = enc.transform(val_n_X)
rf_model_on_full_data.fit(transformed_train_n_X,train_n_y)

错误

Traceback (most recent call last):   File "C:/Users/harsh/PycharmProjects/Learn Machine Learning/Attempt1/Working with soccer data attempt 1",line 137,in <module>
    train_n_X,random_state=1)   File "C:\Users\harsh\PycharmProjects\Learn Machine Learning\venv\lib\site-packages\sklearn\model_selection\_split.py",line 2127,in train_test_split
    arrays = indexable(*arrays)   File "C:\Users\harsh\PycharmProjects\Learn Machine Learning\venv\lib\site-packages\sklearn\utils\validation.py",line 292,in indexable
    check_consistent_length(*result)   File "C:\Users\harsh\PycharmProjects\Learn Machine Learning\venv\lib\site-packages\sklearn\utils\validation.py",line 256,in check_consistent_length
    " samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [380,6643]

文件concat_cleaned_datasettest_data

问题:

  1. 在这种情况下更改数组大小是否会有所帮助?
  2. 如果训练模型可以处理NaN值,为什么不能从我的 test_data?

解决方法

我在这里犯的错误是我正在根据train_n_X,val_n_X,train_n_y,val_n_y = train_test_split(Y,y,random_state=1)中的训练数据集维度拆分新数据集。

正确的分割应该按照新的数据集进行。因此,您的代码应该是:

train_n_X,ny,random_state=1) 

其中ny的长度与新数据集的长度相同

p.s。此代码需要修改,因为您的训练数据集维度与预测数据集不同。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。