微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

完成功能以填写缺失值

如何解决完成功能以填写缺失值

嗨,我目前正在尝试完成此python 3.x练习。我需要完成功能以查找缺少的值并替换它们。我在想可以使用线性回归来填写缺失值。

解决我正在重塑的错误

解决,该错误是“ ValueError:无法将字符串转换为float:'2012/3/13 16:00:00'”。

-问题- 我可以使用该功能并吐出一个输出,但是对于所有预期的输出我只能得到相同的值

如果有更好的方法可以做到这一点,我将不胜感激。预先感谢。

I have a screenshot of the problem here

问题

向您提供河流中汞含量的每日读数的时间序列。在每个测试用例中,某些日子都没有达到当天的最高水平。通过分析数据,尝试确定当日缺少的汞含量。每行数据包含两个制表符分隔的值:时间戳记和当天的最高读数。

每个输入文件中恰好有20行标记为丢失。缺少的值分别标记为“ Missing_1”,“ Missing_2”,...,“ Missing_20”。这些丢失的记录已经随机分散在数据行中。

约束

水银含量全部

功能描述

在下面的编辑器中完成calcMissing函数。它应该以浮点数打印20行,每个缺失值一行。

def calcMissing(readings):
    # Write your code here

if __name__ == '__main__':
    readings_count = int(input().strip())

    readings = []

    for _ in range(readings_count):
        readings_item = input()
        readings.append(readings_item)

    calcMissing(readings)

这是我到目前为止所拥有的:


#!/bin/python3

import math
import os
import random
import re
import sys


#
# Complete the 'calcMissing' function below.
#
# The function accepts STRING_ARRAY readings as parameter.
#


def calcMissing(readings):
    from datetime import datetime
    from sklearn.linear_model import LogisticRegression,LinearRegression,Ridge,SGDRegressor
    from sklearn.model_selection import train_test_split
    import pandas as pd
    import numpy as np

    dates = []
    temp_values = []
    for x in readings:
        temp_list = x.split('\t')
        float_days = datetime.strptime(temp_list[0],'%m/%d/%Y %H:%M:%s')
        dates.append(float_days)
        try:
            temp_values.append(float(temp_list[1]))
        except:
            temp_values.append(np.nan)
            pass

    temp_df = pd.Series(temp_values,index=dates)
    temp_df.index.name = 'Date'

    temp_df = temp_df.reset_index(name='Temp')
    missing_temp_dates = temp_df[temp_df['Temp'].isnull()]['Date'].values
    missing_temp_dates = missing_temp_dates.astype('datetime64[D]').astype(int)
    missing_temp_dates = [[x] for x in missing_temp_dates]
    missing_temp_dates = np.asarray(missing_temp_dates)

    temp_df = temp_df.dropna()
    dates,temps = [[x] for x in temp_df['Date'].values],temp_df['Temp'].values


    X,y = np.asarray(dates),np.asarray(temps)

    from sklearn.ensemble import GradientBoostingRegressor
    mdl = GradientBoostingRegressor()
    mdl.fit(X,y)

    y_pred = mdl.predict(missing_temp_dates)
    for pred in y_pred:
        print(pred)        

if __name__ == '__main__':
    readings_count = int(input().strip())

    readings = []

    for _ in range(readings_count):
        readings_item = input()
        readings.append(readings_item)

    calcMissing(readings)

我得到:

编译器消息

Wrong Answer

您的输出(标准输出

27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584

预期产量

32.69
32.15
32.61
29.3
28.96
28.78
31.05
29.58
29.5
30.9
31.26
31.48
29.74
29.31
29.72
28.88
30.2
27.3
26.7
27.52

解决方法

  1. 此代码绝对有帮助:
        
#!/bin/python3
import math
import os
import random
import re
import sys

def calcMissing(readings):

    import datetime as dt

    import pandas as pd
    import numpy as np
    import random
    dates = []
    temp_values = []
    for x in readings:
        temp_list = x.split('\t')
        dates.append(temp_list[0])
        temp_values.append(temp_list[1])
    

    df = pd.DataFrame(list(zip(dates,temp_values)),columns =['X','Y'])
    df['X'] = pd.to_datetime(df['X'])

    missing_indices = df.Y.str.contains('^Missing')
    missing_dates = df.X[missing_indices]
        
    b=df.index[missing_indices]
    df1 = df.drop(b)

    df1['Date']=df1['X'].map(dt.datetime.toordinal)
    df1['Y'] = pd.to_numeric(df1['Y'])

    x = df1[['Date']]
    y = np.asarray(df1['Y'])
    
    from sklearn.ensemble import GradientBoostingRegressor
    lr = GradientBoostingRegressor()
    lr.fit(x,y)

    
    x_test = df.X[b]
    x_test1 = x_test.map(dt.datetime.toordinal)
    y_pred = lr.predict(np.array(x_test1).reshape(-1,1))
    for p in y_pred:
        print(p) 
    
,

我认为set_index行中的错误,因为您有一列“值”和一个索引“日期”。这就是您的DataFrame只有一列的原因。

尝试删除set_index并检查

,

鉴于示例图的噪声很大,我认为线性回归模型不能很好地工作。由于丢失的数据是随机分散的,因此可以在丢失值之前和之后的第二天的值之间进行简单的插值,假设这些值也没有丢失。

Time series of mercury readings before and after

输入中的连续值不一定要相隔一天,如以下测试输入所示:

12/13/2012 16:00:00 27.52
12/14/2012 16:00:00 Missing_19
12/17/2012 16:00:00 27.215

因此,在进行插值时,您可以根据其接近的日期对其进行加权。

这个问题实际上与HackerRank's Missing Stock Prices相同,除了股票价格外,其他数字相同。您可以从排行榜或社论中查看解决方案。

,

应人们的要求向@Vamshi Krishna Gundu 的代码添加评论:

#!/bin/python3
import math
import os
import random
import re
import sys

def calcMissing(readings):

    import datetime as dt

    import pandas as pd
    import numpy as np
    import random
    dates = []
    temp_values = []
    """
    The steps are:
    * Create a list of dates (dates) and a list of mercury levels (temp_values)
        * By looping through each line (x) of readings
            * Split the strings (x) into a list (temp_list) by specifying the tab separator
            * Append the dates with temp_list[0]
            * Append the temp_values (mercury levels) with temp_list[1]
    * Create a Pandas Data Frame (df) with the info
    * Find indeces of df corresponding to missing data -> b
    * Create a Pandas Data Frame (df1) with the known info
    * Do not forget to:
        * Convert the string of dates column of (df['X']) into standard date format to later convert to ordinal numbers.
        * Convert the mercury level data to numeric (they contain /n for ENTER symbol)
    * USE (MACHINE LEARNING) GradientBoostingRegressor FOR REGRESSION
    * Print missing values    
    """
    
    for x in readings:
        #***************SPLIT THE DATA
        temp_list = x.split('\t')
        dates.append(temp_list[0])
        temp_values.append(temp_list[1])
            
    #********************df => DataFrame with dates ('X') and levels of mercury as headers ('Y')
    df = pd.DataFrame(list(zip(dates,'Y'])

    
    #********************CONVERT TIME TO STD FORMAT
    df['X'] = pd.to_datetime(df['X']) 

    
    
    #********************FIND INDECES OF MISSING DATA
    # Find the Missing data in the mercury level column (Y)
    missing_indices = df.Y.str.contains('Missing') # The caret (^) is meaningless  (Series of Booleans)
    
    #********************FIND CORRESPONDING X VALUES (DATES) OF INDECES
    missing_dates = df.X[missing_indices] # Find the dates corresponding to missing_indices

    
    #********************FIND CORRESPONDING INDECES
    b=df.index[missing_indices] # Find index number of missing data
      
    
    
    #********************CREATE A NEW df1 WITHOUT MISSING DATA FOR TESTING
    df1 = df.drop(b) # Create a new data frame (df1) without missing data
 

    #********************CONVERT x TO ORDINAL NUMBER (date to number)
    df1['Date']=df1['X'].map(dt.datetime.toordinal) # Add a column to df1 (Date) with the dates converted to ordinal numbers 
    
     #********************CONVERT y TO NUMERIC NUMBER (delete \n)
    df1['Y'] = pd.to_numeric(df1['Y']) # Convert the mercury level column to numerics (removes the '\n')
        
    #*********************SAVE x and y
    x = df1[['Date']] # The single bracket will output a Pandas Series,while a double bracket will output a ***Pandas DataFrame***.
    # We use double brackets to display with the header

    y = np.asarray(df1['Y']) # Convert df1['Y'] as a *****one dimensional***** array (a list)
        
    
    
    #*********************USE (MACHINE LEARNING) GradientBoostingRegressor FOR REGRESSION
    from sklearn.ensemble import GradientBoostingRegressor
    
    #*********************LOAD THE REGRESSOR re
    re = GradientBoostingRegressor()
    
    #*********************FIT THE DATA WITH re
    re.fit(x,y)  # Fit the known data
    
    
    #*********************ASSIGN x_test1 -> DATE IN ORDINAL NUMBER CORRESPONDING TO MISSING DATA
    x_test = df.X[b] # Test the missing data
    x_test1 = x_test.map(dt.datetime.toordinal)
    
    #*********************PREDICT THE MISSING VALUES BY USING x_test1 IN THE REGRESSION ESTIMATOR re
    y_pred = re.predict(np.array(x_test1).reshape(-1,1)) # Now trying to reshape with (-1,1) . We have provided column as 1 but rows as unknown . 
    
    #*********************PRINT EACH MISSING VALUE IN A SEPARATE LINE
    for el in y_pred:
        print(el) 

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。