完成功能以填写缺失值

如何解决完成功能以填写缺失值

嗨，我目前正在尝试完成此python 3.x练习。我需要完成功能以查找缺少的值并替换它们。我在想可以使用线性回归来填写缺失值。

已解决我正在重塑的错误。

已解决，该错误是“ ValueError：无法将字符串转换为float：'2012/3/13 16:00:00'”。

-问题- 我可以使用该功能并吐出一个输出，但是对于所有预期的输出我只能得到相同的值

如果有更好的方法可以做到这一点，我将不胜感激。预先感谢。

I have a screenshot of the problem here

问题

向您提供河流中汞含量的每日读数的时间序列。在每个测试用例中，某些日子都没有达到当天的最高水平。通过分析数据，尝试确定当日缺少的汞含量。每行数据包含两个制表符分隔的值：时间戳记和当天的最高读数。

每个输入文件中恰好有20行标记为丢失。缺少的值分别标记为“ Missing_1”，“ Missing_2”，...，“ Missing_20”。这些丢失的记录已经随机分散在数据行中。

约束

水银含量全部

功能描述

在下面的编辑器中完成calcMissing函数。它应该以浮点数打印20行，每个缺失值一行。

def calcMissing(readings):
    # Write your code here

if __name__ == '__main__':
    readings_count = int(input().strip())

    readings = []

    for _ in range(readings_count):
        readings_item = input()
        readings.append(readings_item)

    calcMissing(readings)

这是我到目前为止所拥有的：


#!/bin/python3

import math
import os
import random
import re
import sys


#
# Complete the 'calcMissing' function below.
#
# The function accepts STRING_ARRAY readings as parameter.
#


def calcMissing(readings):
    from datetime import datetime
    from sklearn.linear_model import LogisticRegression,LinearRegression,Ridge,SGDRegressor
    from sklearn.model_selection import train_test_split
    import pandas as pd
    import numpy as np

    dates = []
    temp_values = []
    for x in readings:
        temp_list = x.split('\t')
        float_days = datetime.strptime(temp_list[0],'%m/%d/%Y %H:%M:%s')
        dates.append(float_days)
        try:
            temp_values.append(float(temp_list[1]))
        except:
            temp_values.append(np.nan)
            pass

    temp_df = pd.Series(temp_values,index=dates)
    temp_df.index.name = 'Date'

    temp_df = temp_df.reset_index(name='Temp')
    missing_temp_dates = temp_df[temp_df['Temp'].isnull()]['Date'].values
    missing_temp_dates = missing_temp_dates.astype('datetime64[D]').astype(int)
    missing_temp_dates = [[x] for x in missing_temp_dates]
    missing_temp_dates = np.asarray(missing_temp_dates)

    temp_df = temp_df.dropna()
    dates,temps = [[x] for x in temp_df['Date'].values],temp_df['Temp'].values


    X,y = np.asarray(dates),np.asarray(temps)

    from sklearn.ensemble import GradientBoostingRegressor
    mdl = GradientBoostingRegressor()
    mdl.fit(X,y)

    y_pred = mdl.predict(missing_temp_dates)
    for pred in y_pred:
        print(pred)        

if __name__ == '__main__':
    readings_count = int(input().strip())

    readings = []

    for _ in range(readings_count):
        readings_item = input()
        readings.append(readings_item)

    calcMissing(readings)

我得到：

编译器消息

Wrong Answer

您的输出（标准输出）

27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584
27.066148276858584

预期产量

解决方法

此代码绝对有帮助：

        
#!/bin/python3
import math
import os
import random
import re
import sys

def calcMissing(readings):

    import datetime as dt

    import pandas as pd
    import numpy as np
    import random
    dates = []
    temp_values = []
    for x in readings:
        temp_list = x.split('\t')
        dates.append(temp_list[0])
        temp_values.append(temp_list[1])
    

    df = pd.DataFrame(list(zip(dates,temp_values)),columns =['X','Y'])
    df['X'] = pd.to_datetime(df['X'])

    missing_indices = df.Y.str.contains('^Missing')
    missing_dates = df.X[missing_indices]
        
    b=df.index[missing_indices]
    df1 = df.drop(b)

    df1['Date']=df1['X'].map(dt.datetime.toordinal)
    df1['Y'] = pd.to_numeric(df1['Y'])

    x = df1[['Date']]
    y = np.asarray(df1['Y'])
    
    from sklearn.ensemble import GradientBoostingRegressor
    lr = GradientBoostingRegressor()
    lr.fit(x,y)

    
    x_test = df.X[b]
    x_test1 = x_test.map(dt.datetime.toordinal)
    y_pred = lr.predict(np.array(x_test1).reshape(-1,1))
    for p in y_pred:
        print(p)

我认为set_index行中的错误，因为您有一列“值”和一个索引“日期”。这就是您的DataFrame只有一列的原因。

尝试删除set_index并检查

鉴于示例图的噪声很大，我认为线性回归模型不能很好地工作。由于丢失的数据是随机分散的，因此可以在丢失值之前和之后的第二天的值之间进行简单的插值，假设这些值也没有丢失。

输入中的连续值不一定要相隔一天，如以下测试输入所示：

12/13/2012 16:00:00 27.52
12/14/2012 16:00:00 Missing_19
12/17/2012 16:00:00 27.215

因此，在进行插值时，您可以根据其接近的日期对其进行加权。

这个问题实际上与HackerRank's Missing Stock Prices相同，除了股票价格外，其他数字相同。您可以从排行榜或社论中查看解决方案。

应人们的要求向@Vamshi Krishna Gundu 的代码添加评论：

#!/bin/python3
import math
import os
import random
import re
import sys

def calcMissing(readings):

    import datetime as dt

    import pandas as pd
    import numpy as np
    import random
    dates = []
    temp_values = []
    """
    The steps are:
    * Create a list of dates (dates) and a list of mercury levels (temp_values)
        * By looping through each line (x) of readings
            * Split the strings (x) into a list (temp_list) by specifying the tab separator
            * Append the dates with temp_list[0]
            * Append the temp_values (mercury levels) with temp_list[1]
    * Create a Pandas Data Frame (df) with the info
    * Find indeces of df corresponding to missing data -> b
    * Create a Pandas Data Frame (df1) with the known info
    * Do not forget to:
        * Convert the string of dates column of (df['X']) into standard date format to later convert to ordinal numbers.
        * Convert the mercury level data to numeric (they contain /n for ENTER symbol)
    * USE (MACHINE LEARNING) GradientBoostingRegressor FOR REGRESSION
    * Print missing values    
    """
    
    for x in readings:
        #***************SPLIT THE DATA
        temp_list = x.split('\t')
        dates.append(temp_list[0])
        temp_values.append(temp_list[1])
            
    #********************df => DataFrame with dates ('X') and levels of mercury as headers ('Y')
    df = pd.DataFrame(list(zip(dates,'Y'])

    
    #********************CONVERT TIME TO STD FORMAT
    df['X'] = pd.to_datetime(df['X']) 

    
    
    #********************FIND INDECES OF MISSING DATA
    # Find the Missing data in the mercury level column (Y)
    missing_indices = df.Y.str.contains('Missing') # The caret (^) is meaningless  (Series of Booleans)
    
    #********************FIND CORRESPONDING X VALUES (DATES) OF INDECES
    missing_dates = df.X[missing_indices] # Find the dates corresponding to missing_indices

    
    #********************FIND CORRESPONDING INDECES
    b=df.index[missing_indices] # Find index number of missing data
      
    
    
    #********************CREATE A NEW df1 WITHOUT MISSING DATA FOR TESTING
    df1 = df.drop(b) # Create a new data frame (df1) without missing data
 

    #********************CONVERT x TO ORDINAL NUMBER (date to number)
    df1['Date']=df1['X'].map(dt.datetime.toordinal) # Add a column to df1 (Date) with the dates converted to ordinal numbers 
    
     #********************CONVERT y TO NUMERIC NUMBER (delete \n)
    df1['Y'] = pd.to_numeric(df1['Y']) # Convert the mercury level column to numerics (removes the '\n')
        
    #*********************SAVE x and y
    x = df1[['Date']] # The single bracket will output a Pandas Series,while a double bracket will output a ***Pandas DataFrame***.
    # We use double brackets to display with the header

    y = np.asarray(df1['Y']) # Convert df1['Y'] as a *****one dimensional***** array (a list)
        
    
    
    #*********************USE (MACHINE LEARNING) GradientBoostingRegressor FOR REGRESSION
    from sklearn.ensemble import GradientBoostingRegressor
    
    #*********************LOAD THE REGRESSOR re
    re = GradientBoostingRegressor()
    
    #*********************FIT THE DATA WITH re
    re.fit(x,y)  # Fit the known data
    
    
    #*********************ASSIGN x_test1 -> DATE IN ORDINAL NUMBER CORRESPONDING TO MISSING DATA
    x_test = df.X[b] # Test the missing data
    x_test1 = x_test.map(dt.datetime.toordinal)
    
    #*********************PREDICT THE MISSING VALUES BY USING x_test1 IN THE REGRESSION ESTIMATOR re
    y_pred = re.predict(np.array(x_test1).reshape(-1,1)) # Now trying to reshape with (-1,1) . We have provided column as 1 but rows as unknown . 
    
    #*********************PRINT EACH MISSING VALUE IN A SEPARATE LINE
    for el in y_pred:
        print(el)