如何从具有数值和非数值数据的 Pandas DataFrame 中删除异常值

如何解决如何从具有数值和非数值数据的 Pandas DataFrame 中删除异常值

我有一个如下所示的数据框 (cgf)，我只想删除数字列的异常值：

    Product          object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries,0 to 179
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Product        180 non-null    object  
 1   Age            180 non-null    int64   
 2   Gender         180 non-null    object  
 3   Education      180 non-null    category
 4   MaritalStatus  180 non-null    object  
 5   Usage          180 non-null    int64   
 6   fitness        180 non-null    category
 7   Income         180 non-null    int64   
 8   Miles          180 non-null    int64   
dtypes: category(2),int64(4),object(3)

我尝试了几个使用 z-score 和 iqr 方法的脚本，但都没有奏效。例如，这里有一个用于 z-score 的脚本，但它不起作用

from scipy import stats
import numpy as np
z = np.abs(stats.zscore(cgf))   # get the z-score of every value with respect to their columns
print(z)

我收到此错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-102-2759aa3fbd60> in <module>
----> 1 z = np.abs(stats.zscore(cgf))   # get the z-score of every value with respect to their columns
      2 print(z)

~\anaconda3\lib\site-packages\scipy\stats\stats.py in zscore(a,axis,ddof,nan_policy)
   2495         sstd = np.nanstd(a=a,axis=axis,ddof=ddof,keepdims=True)
   2496     else:
-> 2497         mns = a.mean(axis=axis,keepdims=True)
   2498         sstd = a.std(axis=axis,keepdims=True)
   2499 

~\anaconda3\lib\site-packages\numpy\core\_methods.py in _mean(a,dtype,out,keepdims)
    160     ret = umr_sum(arr,keepdims)
    161     if isinstance(ret,mu.ndarray):
--> 162         ret = um.true_divide(
    163                 ret,rcount,out=ret,casting='unsafe',subok=False)
    164         if is_float16_result and out is None:

TypeError: unsupported operand type(s) for /: 'str' and 'int'

这是我尝试过的 iqr 方法，但它也失败了，如下所示：

np.where((cgf < (Q1 - 1.5 * iqr)) | (cgf > (Q3 + 1.5 * iqr)))

错误信息：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-96-bb3dfd2ce6c5> in <module>
----> 1 np.where((cgf < (Q1 - 1.5 * iqr)) | (cgf > (Q3 + 1.5 * iqr)))

~\anaconda3\lib\site-packages\pandas\core\ops\__init__.py in f(self,other)
    702 
    703         # See GH#4537 for discussion of scalar op behavior
--> 704         new_data = dispatch_to_series(self,other,op,axis=axis)
    705         return self._construct_result(new_data)
    706 

~\anaconda3\lib\site-packages\pandas\core\ops\__init__.py in dispatch_to_series(left,right,func,axis)
    273         #  _frame_arith_method_with_reindex
    274 
--> 275         bm = left._mgr.operate_blockwise(right._mgr,array_op)
    276         return type(left)(bm)
    277 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in operate_blockwise(self,array_op)
    362         Apply array_op blockwise with another (aligned) BlockManager.
    363         """
--> 364         return operate_blockwise(self,array_op)
    365 
    366     def apply(self: T,f,align_keys=None,**kwargs) -> T:

~\anaconda3\lib\site-packages\pandas\core\internals\ops.py in operate_blockwise(left,array_op)
     36             lvals,rvals = _get_same_shape_values(blk,rblk,left_ea,right_ea)
     37 
---> 38             res_values = array_op(lvals,rvals)
     39             if left_ea and not right_ea and hasattr(res_values,"reshape"):
     40                 res_values = res_values.reshape(1,-1)

~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left,op)
    228     if should_extension_dispatch(lvalues,rvalues):
    229         # Call the method on lvalues
--> 230         res_values = op(lvalues,rvalues)
    231 
    232     elif is_scalar(rvalues) and isna(rvalues):

~\anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self,other)
     63         other = item_from_zerodim(other)
     64 
---> 65         return method(self,other)
     66 
     67     return new_method

~\anaconda3\lib\site-packages\pandas\core\arrays\categorical.py in func(self,other)
     74         if not self.ordered:
     75             if opname in ["__lt__","__gt__","__le__","__ge__"]:
---> 76                 raise TypeError(
     77                     "Unordered Categoricals can only compare equality or not"
     78                 )

TypeError: Unordered Categoricals can only compare equality or not

我该如何解决其中一些错误？看来我的 df 中分类数据和数字数据的组合导致了问题，但我是新手，我不知道如何修复它以便我可以删除异常值

解决方法

例如，如果您在“年龄”列中删除异常值，则此列中发生的更改将反映在数据框中。即，整行都将被删除。

参考：towardsdatascience

参考：how-to-remove-outliers