微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

当我尝试处理 Pandas 中的缺失值时,某些方法不起作用

如何解决当我尝试处理 Pandas 中的缺失值时,某些方法不起作用

我正在尝试处理数据集中的一些缺失值。这是我用来学习的教程的 link。下面是我用来读取数据的代码

import pandas as pd
import numpy as np

questions = pd.read_csv("./archive/questions.csv")

print(questions.head())

这就是我的数据的样子

enter image description here

这些是我用来处理缺失值的方法。他们都没有工作。

questions.replace(to_replace = np.nan,value = -99)
questions = questions.fillna(method ='pad')
questions.interpolate(method ='linear',limit_direction = 'forward')

然后我尝试删除包含缺失值的行。他们都没有工作。所有这些都返回空数据帧。

questions.dropna()
questions.dropna(how = "all")
questions.dropna(axis = 1)

我做错了什么?

编辑:

来自 questions.head() 的值

[[1 '2008-07-31T21:26:37Z' nan '2011-03-28T00:53:47Z' 1 nan 0.0]
 [4 '2008-07-31T21:42:52Z' nan nan 458 8.0 13.0]
 [6 '2008-07-31T22:08:08Z' nan nan 207 9.0 5.0]
 [8 '2008-07-31T23:33:19Z' '2013-06-03T04:00:25Z' '2015-02-11T08:26:40Z'
  42 nan 8.0]
 [9 '2008-07-31T23:40:59Z' nan nan 1410 1.0 58.0]]

来自 questions.head() 的字典形式的值。

{'Id': {0: 1,1: 4,2: 6,3: 8,4: 9},'CreationDate': {0: '2008-07-31T21:26:37Z',1: '2008-07-31T21:42:52Z',2: '2008-07-31T22:08:08Z',3: '2008-07-31T23:33:19Z',4: '2008-07-31T23:40:59Z'},'ClosedDate': {0: nan,1: nan,2: nan,3: '2013-06-03T04:00:25Z',4: nan},'DeletionDate': {0: '2011-03-28T00:53:47Z',3: '2015-02-11T08:26:40Z','score': {0: 1,1: 458,2: 207,3: 42,4: 1410},'OwnerUserId': {0: nan,1: 8.0,2: 9.0,3: nan,4: 1.0},'AnswerCount': {0: 0.0,1: 13.0,2: 5.0,3: 8.0,4: 58.0}}

关于数据集的信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17203824 entries,0 to 17203823
Data columns (total 7 columns):
 #   Column        Dtype  
---  ------        -----  
 0   Id            int64  
 1   CreationDate  object 
 2   ClosedDate    object 
 3   DeletionDate  object 
 4   score         int64  
 5   OwnerUserId   float64
 6   AnswerCount   float64
dtypes: float64(2),int64(2),object(3)
memory usage: 918.8+ MB

解决方法

您能否尝试明确指定 axis 并查看它是否有效?另一个 fillna() 应该仍然可以在没有轴的情况下工作,但是对于 pad,您需要它以便它知道如何填充缺失值。

>>> questions.fillna(method='pad',axis=1)
  Id          CreationDate            ClosedDate          DeletionDate Score OwnerUserId AnswerCount
0  1  2008-07-31T21:26:37Z  2008-07-31T21:26:37Z  2011-03-28T00:53:47Z     1           1           0
1  4  2008-07-31T21:42:52Z  2008-07-31T21:42:52Z  2008-07-31T21:42:52Z   458           8          13
2  6  2008-07-31T22:08:08Z  2008-07-31T22:08:08Z  2008-07-31T22:08:08Z   207           9           5
3  8  2008-07-31T23:33:19Z  2013-06-03T04:00:25Z  2015-02-11T08:26:40Z    42          42           8
4  9  2008-07-31T23:40:59Z  2008-07-31T23:40:59Z  2008-07-31T23:40:59Z  1410           1          58

只需将 fillna() 应用于整个 DataFrame 即可正常工作。

>>> questions.fillna('-')

   Id          CreationDate            ClosedDate          DeletionDate  Score OwnerUserId  AnswerCount
0   1  2008-07-31T21:26:37Z                     -  2011-03-28T00:53:47Z      1           -          0.0
1   4  2008-07-31T21:42:52Z                     -                     -    458           8         13.0
2   6  2008-07-31T22:08:08Z                     -                     -    207           9          5.0
3   8  2008-07-31T23:33:19Z  2013-06-03T04:00:25Z  2015-02-11T08:26:40Z     42           -          8.0
4   9  2008-07-31T23:40:59Z                     -                     -   1410           1         58.0

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。