如何解决sklearn / PCA-尝试转换高维数据时出错
在尝试使用PCA将高维向量转换为2维时遇到数据错误。
这是我的输入data
,每行有300个维度:
vector
0 [0.01053525,-0.007869658,0.0024931028,-0.04...
1 [-0.024436072,-0.016484523,0.03859031,0.000...
2 [0.015011676,-0.020465894,0.004854744,-0.00...
3 [-0.010836455,-0.006562917,0.00265073,0.022...
4 [-0.018123362,-0.026007563,0.04781856,-0.03...
... ...
45124 [-0.016111804,-0.041917775,0.010192914,-0.0...
45125 [0.0311568,-0.013044083,0.030656694,-0.0126...
45126 [-0.021875003,-0.005635035,0.0076896898,-0....
45127 [-0.0062000924,-0.041035958,0.0077403532,0....
45128 [0.007794927,0.0019561667,0.15995999,-0.054...
[45129 rows x 1 columns]
我的代码:
data = pd.read_parquet('1.parquet',engine='fastparquet')
reduced = pca.fit_transform(data)
错误:
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number,not 'list'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-15-8e547411a212> in <module>
----> 1 reduced = pca.fit_transform(data)
...
...
ValueError: setting an array element with a sequence.
修改
>>data.shape
(45129,1)
>>data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45129 entries,0 to 45128
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vector 45129 non-null object
dtypes: object(1)
memory usage: 352.7+ KB
解决方法
Scikit-learn不知道如何处理包含数组(列表)的列,因此您需要扩展该列。由于每一行都有相同大小的数组,因此仅需45,000行就可以相当容易地做到这一点。扩展数据后,就可以了。
import pandas as pd
from sklearn.decomposition import PCA
df = pd.DataFrame({"a": [[0.01,0.02,0.03],[0.04,0.4,0.1]]})
expanded_df = pd.DataFrame(df.a.tolist())
expanded_df
0 1 2
0 0.01 0.02 0.03
1 0.04 0.40 0.10
pca = PCA(n_components=2)
reduced = pca.fit_transform(expanded_df)
reduced
array([[ 1.93778224e-01,1.43048962e-17],[-1.93778224e-01,1.43048962e-17]])
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。