python-pyPandas：默认列名称

如果读取具有默认列名的文件,该如何命名？
df [1]似乎几乎一直都能工作.但是,在编写条件时会抱怨类型：

In [60]: cond = ((df[1] != node) & (df[2] != deco))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/ferreirafm/work/colab/SNP/rawdata/<ipython-input-60-513a433bfeb5> in <module>()
----> 1 cond = ((df[1] != node) & (df[2] != deco))

/usr/lib64/python2.7/site-packages/pandas/core/series.pyc in wrapper(self, other)
140             if np.isscalar(res):
141                 raise TypeError('Could not compare %s type with Series'
--> 142                                 % type(other))
143             return Series(na_op(values, other),
144                           index=self.index, name=self.name)

TypeError: Could not compare <type 'str'> type with Series

默认名称对待dataframe列更适合我的应用程序.

解决方法:

似乎您将一系列标量值与字符串进行了比较：

In [73]: node = 'a'

In [74]: deco = 'b'

In [75]: data = [(10, 'a', 1), (11, 'b', 2), (12, 'c', 3)]

In [76]: df = pd.DataFrame(data)

In [77]: df
Out[77]: 
    0  1  2
0  10  a  1
1  11  b  2
2  12  c  3

In [78]: cond = ((df[1] != node) & (df[2] != deco))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-78-0afad3702859> in <module>()
----> 1 cond = ((df[1] != node) & (df[2] != deco))

/home/.../python2.7/site-packages/pandas/core/series.pyc in wrapper(self, other)
    140             if np.isscalar(res):
    141                 raise TypeError('Could not compare %s type with Series'
--> 142                                 % type(other))
    143             return Series(na_op(values, other),
    144                           index=self.index, name=self.name)

TypeError: Could not compare <type 'str'> type with Series

请注意,熊猫可以处理一系列字符串和数字,但是比较字符串和数字的确没有意义,因此错误消息很有用.
但是,熊猫也许应该给出更详细的错误信息.

如果您对第2列的条件是数字,那么它将起作用：

In [79]: deco = 3

In [80]: cond = ((df[1] != node) & (df[2] != deco))

In [81]: df[cond]
Out[81]: 
    0  1  2
1  11  b  2

一些评论：

也许您的一些困惑是由于熊猫的设计决定引起的：

如果您使用read_csv从文件中读取数据,则将所得数据帧的默认列名称设置为X.1到X.N(对于版本== 0.9,则设置为X1到XN),它们是字符串.

如果从现有数组或列表或其他内容创建数据框,则列名默认为0到N,并且是整数.

In [23]: df = pd.read_csv(StringIO(data), header=None)

In [24]: df.columns
Out[24]: Index([X.1, X.2, X.3], dtype=object)

In [25]: df.columns[0]
Out[25]: 'X.1'

In [26]: type(df.columns[0])
Out[26]: str

In [27]: df = pd.DataFrame(randn(2,3))

In [30]: df.columns
Out[30]: Int64Index([0, 1, 2])

In [31]: df.columns[0]
Out[31]: 0

In [32]: type(df.columns[0])
Out[32]: numpy.int64

我打开了一个 ticket来讨论这个问题.

所以你

In [60]: cond = ((df[1] != node) & (df[2] != deco))

如果df [1]和df [2]的类型与node和deco的类型相同,则它应该适用于从数组或其他内容创建的数据帧.

如果您已使用read_csv读取文件,则

In [60]: cond = ((df['X.2'] != node) & (df['X.3'] != deco))

应该与版本< 0.9,虽然应该是

In [60]: cond = ((df['X2'] != node) & (df['X3'] != deco))

版本> = 0.9.

python-pyPandas：默认列名称

相关推荐