如何解决我有数据泄漏吗?
我正在尝试提前一天预测 EUR/USD 的收盘价,并且我已经创建了一个基本模型来开始使用管道。然而,结果好得令人难以置信,我确定我在某处发生了数据泄漏,但我找不到。
这是运行模型和创建管道的代码:
estimators = []
estimators.append(("strings_to_floats",StringToFloat(string_features)))
estimators.append(("series_supervised",SeriesToSupervised(n_in)))
estimators.append(("current_feature_remove",RemoveCurrentFeatures(features=current_features_remove)))
# Model pipeline
estimators.append(("SGD",SGDRegressor(max_iter=50000,tol=1e-3)))
model = Pipeline(estimators)
# Evaluate Pipeline
model.fit(train_X_,train_y_)
predictions = model.predict(test_X_)
def series_to_supervised(self,data,n_in=5,n_out=1,dropnan=True):
"""
Frame a time series as a supervised learning dataset.
Arguments:
data: Sequence of observations as a list or NumPy array.
n_in: Number of lag observations as input (X).
n_out: Number of observations as output (y).
dropnan: Boolean whether or not to drop rows with NaN values.
Returns:
Pandas DataFrame of series framed for supervised learning.
"""
old_names = data.columns
n_vars = 1 if type(data) is list else data.shape[1]
df = pd.DataFrame(data,columns=old_names)
cols,names = list(),list()
# input sequence (t-n,... t-1)
for i in range(n_in,-1):
cols.append(df.shift(i))
names += [('%s(t-%d)' % (old_names[j],i)) for j in range(n_vars)]
# forecast sequence (t,t+1,... t+n)
for i in range(0,n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('%s' % (old_names[j])) for j in range(n_vars)]
else:
names += [('%s' % (old_names[j])) for j in range(n_vars)]
# Remove spaces in names
for i in range(len(names)):
names[i-1] = names[i-1].strip()
# put it all together
agg = concat(cols,axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
RemoveCurrentFeatures
只需遍历此列表:["Open","High","Low","Change %","Price"] 并删除这些列。
数据集以上面列表中的列加上“日期”开头。数据准备后,数据框的列格式为“Price(t-n_in)”,其中 n_in 是数据滞后的天数。
任何帮助将不胜感激,我已经坚持了一段时间了,我确定这里出了点问题。
编辑: 以下是我如何进行测试和训练拆分:
# Invert dataframe
data = data.iloc[::-1]
# Split each set into train and test sets
names = data.columns.values
dataFrame_train = pd.DataFrame(data[:int(data.shape[0]*train_test_split)],columns=names)
train_X = dataFrame_train#.iloc[:,0:-1]
train_y = dataFrame_train["Price"]
train_y = train_y.tail(train_y.shape[0] - n_in)
dataFrame_test = pd.DataFrame(data[int(data.shape[0]*train_test_split):],columns=names)
test_X = dataFrame_test#.iloc[:,0:-1]
test_y = dataFrame_test["Price"]
test_y = test_y.tail(test_y.shape[0] - n_in)
dataFrame_test = dataFrame_test.tail(dataFrame_test.shape[0] - n_in)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。