微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

我有数据泄漏吗?

如何解决我有数据泄漏吗?

我正在尝试提前一天预测 EUR/USD 的收盘价,并且我已经创建了一个基本模型来开始使用管道。然而,结果好得令人难以置信,我确定我在某处发生了数据泄漏,但我找不到。

这是运行模型和创建管道的代码

estimators = []
estimators.append(("strings_to_floats",StringToFloat(string_features)))
estimators.append(("series_supervised",SeriesToSupervised(n_in)))
estimators.append(("current_feature_remove",RemoveCurrentFeatures(features=current_features_remove)))

# Model pipeline
estimators.append(("SGD",SGDRegressor(max_iter=50000,tol=1e-3)))
model = Pipeline(estimators)
# Evaluate Pipeline
model.fit(train_X_,train_y_)
predictions = model.predict(test_X_)

SeriesToSupervised 代码在这里

def series_to_supervised(self,data,n_in=5,n_out=1,dropnan=True):
        """
        Frame a time series as a supervised learning dataset.
        Arguments:
            data: Sequence of observations as a list or NumPy array.
            n_in: Number of lag observations as input (X).
            n_out: Number of observations as output (y).
            dropnan: Boolean whether or not to drop rows with NaN values.
        Returns:
            Pandas DataFrame of series framed for supervised learning.
        """
        old_names = data.columns
        n_vars = 1 if type(data) is list else data.shape[1]
        df = pd.DataFrame(data,columns=old_names)
        cols,names = list(),list()
        # input sequence (t-n,... t-1)
        for i in range(n_in,-1):
            cols.append(df.shift(i))
            names += [('%s(t-%d)' % (old_names[j],i)) for j in range(n_vars)]
        # forecast sequence (t,t+1,... t+n)
        for i in range(0,n_out):
            cols.append(df.shift(-i))
            if i == 0:
                names += [('%s' % (old_names[j])) for j in range(n_vars)]
            else:
                names += [('%s' % (old_names[j])) for j in range(n_vars)]
        # Remove spaces in names
        for i in range(len(names)):
            names[i-1] = names[i-1].strip()
        # put it all together
        agg = concat(cols,axis=1)
        agg.columns = names
        # drop rows with NaN values
        if dropnan:
            agg.dropna(inplace=True)
        return agg

RemoveCurrentFeatures 只需遍历此列表:["Open","High","Low","Change %","Price"] 并删除这些列。

数据集以上面列表中的列加上“日期”开头。数据准备后,数据框的列格式为“Price(t-n_in)”,其中 n_in 是数据滞后的天数。

任何帮助将不胜感激,我已经坚持了一段时间了,我确定这里出了点问题。

编辑: 以下是我如何进行测试和训练拆分:

# Invert dataframe
    data = data.iloc[::-1]

    # Split each set into train and test sets
    names = data.columns.values

    dataFrame_train = pd.DataFrame(data[:int(data.shape[0]*train_test_split)],columns=names)
    train_X = dataFrame_train#.iloc[:,0:-1]
    train_y = dataFrame_train["Price"]
    train_y = train_y.tail(train_y.shape[0] - n_in)

    dataFrame_test = pd.DataFrame(data[int(data.shape[0]*train_test_split):],columns=names)
    test_X = dataFrame_test#.iloc[:,0:-1]
    test_y = dataFrame_test["Price"]
    test_y = test_y.tail(test_y.shape[0] - n_in)

    dataFrame_test = dataFrame_test.tail(dataFrame_test.shape[0] - n_in)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。