我正在尝试将Pandas数据帧转换为NumPy数组以使用Sklearn创建模型.我会在这里简化问题.
>>> mydf.head(10)
IdVisita
445 latam
446 NaN
447 grados
448 grados
449 eventos
450 eventos
451 Reescribe-medios-clases-online
454 postgrados
455 postgrados
456 postgrados
Name: cat1, dtype: object
>>> from sklearn import preprocessing
>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit(mydf)
追溯:
ValueError Traceback (most recent call last)
<ipython-input-74-f581ab15cbed> in <module>()
2 mydf.head(10)
3 enc = preprocessing.OneHotEncoder()
----> 4 enc.fit(mydf)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit(self, X, y)
996 self
997 """
--> 998 self.fit_transform(X)
999 return self
1000
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y)
1052 """
1053 return _transform_selected(X, self._fit_transform,
-> 1054 self.categorical_features, copy=True)
1055
1056 def _transform(self, X):
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy)
870 """
871 if selected == "all":
--> 872 return transform(X)
873
874 X = atleast2d_or_csc(X, copy=copy)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _fit_transform(self, X)
1001 def _fit_transform(self, X):
1002 """Assumes X contains only categorical features."""
-> 1003 X = check_arrays(X, sparse_format='dense', dtype=np.int)[0]
1004 if np.any(X < 0):
1005 raise ValueError("X needs to contain only non-negative integers.")
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options)
279 array = np.ascontiguousarray(array, dtype=dtype)
280 else:
--> 281 array = np.asarray(array, dtype=dtype)
282 if not allow_nans:
283 _assert_all_finite(array)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
460
461 """
--> 462 return array(a, dtype, copy=False, order=order)
463
464 def asanyarray(a, dtype=None, order=None):
ValueError: invalid literal for long() with base 10: 'postgrados'
注意IdVisita是这里的索引,数字可能不是全部连续的.
有线索吗?
解决方法:
你在这里的错误是你正在调用文档中的OneHotEncoder
The input to this transformer should be a matrix of integers
但你的df有一个单独的’cat1’列,它是dtype对象,实际上是一个String.
你应该使用LabelEcnoder:
In [13]:
le = preprocessing.LabelEncoder()
le.fit(df.dropna().values)
le.classes_
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\sklearn\preprocessing\label.py:108: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Out[13]:
array(['Reescribe-medios-clases-online', 'eventos', 'grados', 'latam',
'postgrados'], dtype=object)
注意我必须删除NaN行,因为这将引入一个不能用于排序的混合dtype,例如浮动> str不起作用
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。