Pandas——数据分析核心工具包
基于Numpy构建,为数据分析而存在
import numpy as np
import pandas as pd
数据结构
Pandas所有数据结构都带有index
Series
可以理解为一个带标签的一维数组,可以保存任何数据类型(整数、字符串、浮点数、Python对象等(),轴标签统称为索引
.index查看Series的索引,返回一个RangeIndex生成器
.values查看Series的值,类型是ndarray
- Series同ndarray比较,是一个自带索引index的数组(一维数组+对应索引)
- Series的索引/切片同ndarray类似
- Series同dict比较,是一个有顺序的dict,其索引与值对应类似dict中的键值对应
ar = np.random.rand(5)
s = pd.Series(ar)
print(ar)
print(s, type(s))
print('--------------')
print(list(s.index), type(s.index)) # index
print(s.values, type(s.values)) # 值
[0.32461186 0.63422701 0.51008673 0.16219166 0.40639174]
0 0.324612
1 0.634227
2 0.510087
3 0.162192
4 0.406392
dtype: float64 <class 'pandas.core.series.Series'>
--------------
[0, 1, 2, 3, 4] <class 'pandas.core.indexes.range.RangeIndex'>
[0.32461186 0.63422701 0.51008673 0.16219166 0.40639174] <class 'numpy.ndarray'>
创建Series
dtype为存储数据类型,name为别名
# 由字典创建,key为index,values为values
dic = {'a': 1, 'b':2, 'c':3}
s = pd.Series(dic)
print(s, type(s))
a 1
b 2
c 3
dtype: int64 <class 'pandas.core.series.Series'>
# 由一维数组创建,
arr = np.random.rand(10)
s = pd.Series(arr, index=list('abcdefghjk'), dtype=np.str, name='test')
print(s, type(s))
a 0.32061579565266873
b 0.6422417901855576
c 0.4752166664686672
d 0.14271215219716993
e 0.3484167803947562
f 0.02810385749477773
g 0.4921923545502085
h 0.3364856517354894
j 0.5452820708357551
k 0.6106163951939324
Name: test, dtype: object <class 'pandas.core.series.Series'>
# 通过标量创建
s = pd.Series(100, index=range(4))
print(s)
0 100
1 100
2 100
3 100
dtype: int64
# .rename()重命名一个数组的名称,并指向一个新的数组,原数组不变
print(s)
s2 = s.rename('hhh')
print(s2)
0 100
1 100
2 100
3 100
dtype: int64
0 100
1 100
2 100
3 100
Name: hhh, dtype: int64
Series索引
# 下标索引
# 类似于list,但不完全相同,比如没有-1
s = pd.Series(np.random.rand(10))
print(s)
print(s[5], type(s[5])) # 返回一个float64数值|
0 0.729838
1 0.095203
2 0.180626
3 0.187282
4 0.390732
5 0.417309
6 0.153421
7 0.209588
8 0.921143
9 0.453336
dtype: float64
0.4173088101449465 <class 'numpy.float64'>
# 标签索引
s = pd.Series(np.random.rand(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print(s['a'], type(s['a']))
print('-------------')
# 选取多个标签,生成新的数组
print(s[['a', 'b', 'e']])
a 0.374045
b 0.506414
c 0.756893
d 0.348560
e 0.675542
dtype: float64
0.37404462792248505 <class 'numpy.float64'>
-------------
a 0.374045
b 0.506414
e 0.675542
dtype: float64
Series切片
Series使用数字下标切片,左闭右开;使用标签切片,左闭右闭。
(如果Series的标签为数字,则同数字下标,左闭右开)
s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index=['a', 'b', 'c', 'd', 'e'])
print(s1[1:4], '\n', s1[4])
print('------------')
print(s2['a':'c'], '\n', s2['c'])
1 0.137954
2 0.237755
3 0.990893
dtype: float64
0.9702501216319398
------------
a 0.299157
b 0.435536
c 0.236996
dtype: float64
0.23699606339117407
# 布尔型索引
s = pd.Series(np.random.rand(3))
s[4] = None # 添加一个空值
print(s)
print('\n', s>0.5) # 数组做判断,返回一个由bool值组成的新数组
print('\n', s[s>0.5]) # 布尔型索引
0 0.733328
1 0.901651
2 0.371391
4 None
dtype: object
0 True
1 True
2 False
4 False
dtype: bool
0 0.733328
1 0.901651
dtype: object
数据查看修改
# 数据查看
s = pd.Series(np.random.rand(50))
print(s.head(6)) # .head()查看头部数据
print(s.tail(5)) # .tail()查看尾部数据
0 0.314349
1 0.588733
2 0.103402
3 0.343712
4 0.643559
5 0.658695
dtype: float64
45 0.249720
46 0.322811
47 0.378098
48 0.004036
49 0.028688
dtype: float64
重新索引 reindex
目的是重新为当前Series设置一个新的索引
reindex提取当前数组符合条件的数据返回一个新的数组,条件不满足的行默认为数据缺失
s = pd.Series(np.random.rand(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
s1 = s.reindex(['c', 'd', 'e', 'f', 'g'])
print(s)
print('\n', s1)
s2 = s.reindex(['c', 'd', 'e', 'f', 'g'], fill_value=0) # 将缺失值填充为0
print('\n', s2)
a 0.998153
b 0.290354
c 0.423444
d 0.717207
e 0.876524
dtype: float64
a 0.998153
b 0.290354
c 0.423444
d 0.717207
e 0.876524
dtype: float64
c 0.423444
d 0.717207
e 0.876524
f NaN
g NaN
dtype: float64
c 0.423444
d 0.717207
e 0.876524
f 0.000000
g 0.000000
dtype: float64
对齐
Series之间按照索引对齐做加减,索引一致则做运算,只出现一次的索引的值则为空
s1 = pd.Series(np.random.rand(3), index=['jack', 'marry', 'tom'])
s2 = pd.Series(np.random.rand(3), index=['wang', 'marry', 'tom'])
print(s1, s2)
print('\n', s1+s2)
jack 0.970609
marry 0.348323
tom 0.965782
dtype: float64 wang 0.598366
marry 0.088275
tom 0.262476
dtype: float64
jack NaN
marry 0.436598
tom 1.228258
wang NaN
dtype: float64
删除和添加
- 删除:drop()
默认inplace为False,删除元素后返回副本;设置为True则在原Series做操作 - 添加
直接通过下标索引/标签索引添加值;
.append()添加,添加元素后返回一个新的Series。没有inplace参数
s = pd.Series(np.random.rand(5), index=list('abcde'))
print(s,'\n')
s1 = s.drop('b')
print(s1)
print(s)
print('\n')
s.drop(['c', 'e'], inplace=True)
print(s)
a 0.429560
b 0.425173
c 0.102780
d 0.529564
e 0.100613
dtype: float64
a 0.429560
c 0.102780
d 0.529564
e 0.100613
dtype: float64
a 0.429560
b 0.425173
c 0.102780
d 0.529564
e 0.100613
dtype: float64
a 0.429560
b 0.425173
d 0.529564
dtype: float64
# 添加
s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index=list('asdfg'))
print(s1)
print(s2, '\n')
s1[5] = 100 # 通过下标索引添加
print(s1)
s2['e'] = 100 # 通过标签索引添加
print(s2, '\n')
s3 = s1.append(s2) # append()返回一个新的Series
print(s3)
print(s1)
0 0.985752
1 0.476859
2 0.339323
3 0.529883
4 0.026883
dtype: float64
a 0.256522
s 0.828174
d 0.317496
f 0.118743
g 0.631222
dtype: float64
0 0.985752
1 0.476859
2 0.339323
3 0.529883
4 0.026883
5 100.000000
dtype: float64
a 0.256522
s 0.828174
d 0.317496
f 0.118743
g 0.631222
e 100.000000
dtype: float64
0 0.985752
1 0.476859
2 0.339323
3 0.529883
4 0.026883
5 100.000000
a 0.256522
s 0.828174
d 0.317496
f 0.118743
g 0.631222
e 100.000000
dtype: float64
0 0.985752
1 0.476859
2 0.339323
3 0.529883
4 0.026883
5 100.000000
dtype: float64
Dataframe
Dataframe是一个表格型的数据结构,包含一组有序的列。其列的值类型可以是数值、字符串等。
Dataframe可以理解为一个“带有标签的二维数组”,具有index(行标签)和columns(列标签)。
创建DataFrame
使用list组成的dict创建
- key为Dataframe的列标签columns,values为Dataframe的数值values,dict间长度需要保持一致。
- columns参数可以重新指定列的顺序,格式为list。如果columns传入一个不存在的新列名,产生NaN。
- index参数重新指定DataFrame的index,格式为list。长度需要和DataFrame一致
data = {
'name': ['jack', 'tom', 'harry'],
'age': [4, 5, 6],
'gender': ['m', 'f', 'm']
}
df = pd.DataFrame(data)
print(df)
print(type(df)) # DataFrame类型
print(df.index, type(df.index)) # index
print(df.columns, type(df.columns)) # columns
print(df.values, type(df.values)) # values
name age gender
0 jack 4 m
1 tom 5 f
2 harry 6 m
<class 'pandas.core.frame.DataFrame'>
RangeIndex(start=0, stop=3, step=1) <class 'pandas.core.indexes.range.RangeIndex'>
Index(['name', 'age', 'gender'], dtype='object') <class 'pandas.core.indexes.base.Index'>
[['jack' 4 'm']
['tom' 5 'f']
['harry' 6 'm']] <class 'numpy.ndarray'>
df1 = pd.DataFrame(data, columns=['age', 'gender', 'name'], index=['f1', 'f2', 'f3'])
print(df1)
age gender name
f1 4 m jack
f2 5 f tom
f3 6 m harry
使用Series组成的dict创建
由Series组成的dict进行创建,key为DataFrame的columns,Series的标签为DataFrame的index。
如果各Series长度不一致,生成的DataFrame使用NaN填充
data1 = {
'one': pd.Series(np.random.rand(2)),
'two': pd.Series(np.random.rand(3))
}
print(data1, '\n')
df1 = pd.DataFrame(data1) # 两个Series长度不一致,使用NaN填充
print(df1)
{'one': 0 0.376186
1 0.704161
dtype: float64, 'two': 0 0.989256
1 0.418456
2 0.854511
dtype: float64}
one two
0 0.376186 0.989256
1 0.704161 0.418456
2 NaN 0.854511
使用二维数组创建
使用二维数据进行创建,得到一个同样形状的DataFrame。
如果不指定index和columns,二者均默认为数字格式;如果指定index和columns,二者的长度需和原数组一致
ar = np.random.rand(9).reshape(3, 3)
print(ar, '\n')
df1 = pd.DataFrame(ar)
print(df1, '\n')
df2 = pd.DataFrame(ar,
index=['a', 'b', 'c'],
columns=['one', 'two', 'three']) # 创建时指定index和columns
print(df2)
[[0.81038884 0.28727062 0.43923942]
[0.66731215 0.27171132 0.34258084]
[0.04433758 0.71291395 0.75949802]]
0 1 2
0 0.810389 0.287271 0.439239
1 0.667312 0.271711 0.342581
2 0.044338 0.712914 0.759498
one two three
a 0.810389 0.287271 0.439239
b 0.667312 0.271711 0.342581
c 0.044338 0.712914 0.759498
使用dict组成的list创建
dict的key为DataFrame的columns,values为DataFrame的values
data = [
{'one':1, 'two': 2},
{'one':3, 'two':5, 'three':8}
]
print(data, '\n')
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data, index=['a', 'b']) # 设置index
df3 = pd.DataFrame(data, columns=['one', 'two']) # 设置columns
print(df1, '\n')
print(df2, '\n')
print(df3)
[{'one': 1, 'two': 2}, {'one': 3, 'two': 5, 'three': 8}]
one three two
0 1 NaN 2
1 3 8.0 5
one three two
a 1 NaN 2
b 3 8.0 5
one two
0 1 2
1 3 5
使用dict组成的dict创建
涉及到层次索引/多维标签
- 字典的key为DataFrame的columns,子字典的key为DataFrame的index
- columns参数可以增减现有列,新列以NaN填充
- 不可改变原有index
data = {
'jack': {'math':90, 'eng':86, 'art':78},
'marry': {'math':90, 'eng':87, 'art':70},
'tom': {'math':78, 'eng':83}
}
df1 = pd.DataFrame(data)
print(df1, '\n')
df2 = pd.DataFrame(data, columns=['jack', 'tom', 'bob'])
print(df2, '\n')
# df3 = pd.DataFrame(data, index=['a', 'b', 'c'])
# print(df3)
jack marry tom
art 78 70 NaN
eng 86 87 83.0
math 90 90 78.0
jack tom bob
art 78 NaN NaN
eng 86 83.0 NaN
math 90 78.0 NaN
DataFrame索引
DataFrame既有行索引又有列索引,可以看作由Series组成的字典(共用一个索引)
选择行与列
- 选择列
根据columns选取列,使用df[] - 选择行
根据index选取行,使用.loc - 选择多个行与列
.loc[]进行标签索引进行选取
.iloc[]根据下标索引进行选取
df[]默认选择列,传入数字则选择行,并且只能进行切片选择,返回一个DataFrame
df[]不可通过标签索引来选择行(df['one'])
df = pd.DataFrame(np.random.rand(12).reshape(3, 4)*100,
index=['one', 'two', 'three'],
columns=['a', 'b', 'c', 'd'])
print(df, '\n')
# 选择列
df1 = df['a'] # 选择一列,返回Series
print(df1, type(df1))
df2 = df[['a', 'b']] # 选择多列,返回DataFrame
print(df2, type(df2))
print('--------------\n')
# 选择行
df3 = df.loc['one'] # 选择一行,返回Series
print(df3, type(df3))
df4 = df.loc[['one', 'two']] # 选择多行,返回DataFrame
print(df4, type(df4))
print('---------------\n')
df5 = df[0:1] # 使用df[]下标索引切片选择行,左闭右开,返回DataFrame
print('\n', df5)
df6 = df.loc['one': 'two'] # 使用标签索引切片选择行,左闭右闭,返回DataFrame
print(df6, '\n')
df7 = df.iloc[0] # 使用iloc选择第一行
print(df7)
df8 = df.iloc[0, 1] # 使用iloc选择第一行第二列的值
print(df8)
a b c d
one 16.083085 39.622660 26.864423 80.290848
two 67.677411 19.627380 74.511148 42.858250
three 31.898274 7.567095 21.909344 52.840104
one 16.083085
two 67.677411
three 31.898274
Name: a, dtype: float64 <class 'pandas.core.series.Series'>
a b
one 16.083085 39.622660
two 67.677411 19.627380
three 31.898274 7.567095 <class 'pandas.core.frame.DataFrame'>
--------------
a 16.083085
b 39.622660
c 26.864423
d 80.290848
Name: one, dtype: float64 <class 'pandas.core.series.Series'>
a b c d
one 16.083085 39.62266 26.864423 80.290848
two 67.677411 19.62738 74.511148 42.858250 <class 'pandas.core.frame.DataFrame'>
---------------
a b c d
one 16.083085 39.62266 26.864423 80.290848
a b c d
one 16.083085 39.62266 26.864423 80.290848
two 67.677411 19.62738 74.511148 42.858250
a 16.083085
b 39.622660
c 26.864423
d 80.290848
Name: one, dtype: float64
39.62265968392962
布尔型索引
同Series
df = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100,
index=['one', 'two', 'three', 'four'],
columns=['a', 'b', 'c', 'd'])
print(df)
print('-------------')
df1 = df<20
print(df1, type(df1)) # 返回一个包含True/False值的DataFrame
print(df[df1]) # 返回包含满足条件的值的DataFrame
a b c d
one 19.945788 12.426571 84.385131 64.330791
two 16.446707 49.851884 50.606928 53.838039
three 2.429324 47.543116 30.089095 19.411060
four 13.263280 13.640146 92.664063 95.811193
-------------
a b c d
one True True False False
two True False False False
three True False False True
four True True False False <class 'pandas.core.frame.DataFrame'>
a b c d
one 19.945788 12.426571 NaN NaN
two 16.446707 NaN NaN NaN
three 2.429324 NaN NaN 19.41106
four 13.263280 13.640146 NaN NaN
DataFeame数据查看
数据查看、转置
- .head()查看头部信息
- .tail()查看尾部信息
- .T进行转置
df = pd.DataFrame(np.random.rand(16).reshape(8, 2)*100,
columns=['a', 'b'])
print(df.head(2))
print(df.tail())
print(df.T)
a b
0 27.428915 12.294751
1 81.578430 77.379900
a b
3 8.738676 67.126425
4 73.455421 66.751584
5 5.505302 13.314915
6 85.449624 7.665033
7 62.567230 5.998243
0 1 2 3 4 5 6 \
a 27.428915 81.57843 68.570921 8.738676 73.455421 5.505302 85.449624
b 12.294751 77.37990 83.701388 67.126425 66.751584 13.314915 7.665033
7
a 62.567230
b 5.998243
添加/修改/删除
根据索引进行赋值
df = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100,
index=['one', 'two', 'three', 'four'],
columns=['a', 'b', 'c', 'd'])
print(df)
print('\n新增行/列并赋值')
df['e'] = 10
df.loc[4] = 20
print(df)
print('\n索引后直接修改值')
df['e'] = 20
df[['a', 'c']] = 100
print(df)
df.iloc[::2] = 101
print(df)
a b c d
one 87.069143 40.209805 56.695802 41.019586
two 29.164346 66.945065 18.920092 75.434930
three 87.804254 18.652358 33.611541 62.496783
four 65.088697 2.681203 52.302716 64.559536
新增行/列并赋值
a b c d e
one 87.069143 40.209805 56.695802 41.019586 10
two 29.164346 66.945065 18.920092 75.434930 10
three 87.804254 18.652358 33.611541 62.496783 10
four 65.088697 2.681203 52.302716 64.559536 10
4 20.000000 20.000000 20.000000 20.000000 20
索引后直接修改值
a b c d e
one 100 40.209805 100 41.019586 20
two 100 66.945065 100 75.434930 20
three 100 18.652358 100 62.496783 20
four 100 2.681203 100 64.559536 20
4 100 20.000000 100 20.000000 20
a b c d e
one 101 101.000000 101 101.000000 101
two 100 66.945065 100 75.434930 20
three 101 101.000000 101 101.000000 101
four 100 2.681203 100 64.559536 20
4 101 101.000000 101 101.000000 101
# 删除,del/drop
df = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100,
columns=['a', 'b', 'c', 'd'])
print(df)
del df['a']
print(df)
print('\ndrop删除行。\n默认生成一个新DataFrame,设置inplace=True在原DataFrame上修改')
print(df.drop(0))
print(df)
print('设置inplace=True')
df.drop(0, inplace=True)
print(df)
print('\ndrop删除列,设置axis=1')
print(df.drop(['b'], axis=1))
print(df)
a b c d
0 48.667648 89.599096 26.993105 83.242703
1 69.105642 20.474981 96.408243 97.863509
2 9.583234 67.181335 56.180255 21.870587
3 44.272351 32.549130 93.306515 87.357004
b c d
0 89.599096 26.993105 83.242703
1 20.474981 96.408243 97.863509
2 67.181335 56.180255 21.870587
3 32.549130 93.306515 87.357004
drop删除行。
默认生成一个新DataFrame,设置inplace=True在原DataFrame上修改
b c d
1 20.474981 96.408243 97.863509
2 67.181335 56.180255 21.870587
3 32.549130 93.306515 87.357004
b c d
0 89.599096 26.993105 83.242703
1 20.474981 96.408243 97.863509
2 67.181335 56.180255 21.870587
3 32.549130 93.306515 87.357004
设置inplace=True
b c d
1 20.474981 96.408243 97.863509
2 67.181335 56.180255 21.870587
3 32.549130 93.306515 87.357004
drop删除列,设置axis=1
c d
1 96.408243 97.863509
2 56.180255 21.870587
3 93.306515 87.357004
b c d
1 20.474981 96.408243 97.863509
2 67.181335 56.180255 21.870587
3 32.549130 93.306515 87.357004
对齐
DataFrame之间的数据自动按照columns和index进行对齐
df1 = pd.DataFrame(np.random.rand(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(7, 3), columns=['A', 'B', 'C'])
print(df1+df2)
A B C D
0 1.153285 0.708743 1.664708 NaN
1 0.824788 0.529762 0.800729 NaN
2 1.032009 1.116637 0.038324 NaN
3 0.442938 1.965626 1.041656 NaN
4 1.833386 0.865264 0.325347 NaN
5 0.396912 1.022707 1.414600 NaN
6 0.764291 0.686421 1.648528 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
排序
- 按值排序,sort_values
ascending参数用于设置升序/降序排列 - 按索引排序,sort_index
df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4),
columns=['a', 'b', 'c', 'd'])
print(df1)
print(df1.sort_values(['a'], ascending=False)) # 降序
print(df1.sort_values(['a'], ascending=True)) # 升序
print('\n多值排序')
print(df1.sort_values(['a', 'c']))
a b c d
0 0.693605 0.309871 0.428986 0.761431
1 0.072342 0.694657 0.177274 0.022500
2 0.259270 0.988083 0.568393 0.361062
3 0.350362 0.063009 0.463876 0.573203
a b c d
0 0.693605 0.309871 0.428986 0.761431
3 0.350362 0.063009 0.463876 0.573203
2 0.259270 0.988083 0.568393 0.361062
1 0.072342 0.694657 0.177274 0.022500
a b c d
1 0.072342 0.694657 0.177274 0.022500
2 0.259270 0.988083 0.568393 0.361062
3 0.350362 0.063009 0.463876 0.573203
0 0.693605 0.309871 0.428986 0.761431
多值排序
a b c d
1 0.072342 0.694657 0.177274 0.022500
2 0.259270 0.988083 0.568393 0.361062
3 0.350362 0.063009 0.463876 0.573203
0 0.693605 0.309871 0.428986 0.761431
# 索引排序
df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100,
index=[5, 4, 3, 2],
columns=['a', 'b', 'c', 'd'])
print(df1)
print(df1.sort_index())
a b c d
5 99.185703 98.500810 19.644985 75.354804
4 3.602962 61.132418 45.643154 19.329648
3 71.545548 37.602546 84.432429 40.740473
2 55.051512 25.530674 74.241117 94.541445
a b c d
2 55.051512 25.530674 74.241117 94.541445
3 71.545548 37.602546 84.432429 40.740473
4 3.602962 61.132418 45.643154 19.329648
5 99.185703 98.500810 19.644985 75.354804
时间序列
时间模块datetime
- .date()
.date()接受(年,月,日),返回一个日期
date.today(),返回当前日期 - .datetime()
datetime.datetime()返回一个时间戳
datetime.datetime.Now(),返回当前时间戳
- .timedelta()
import datetime
today = datetime.date.today() # 返回今天日期
print(today, type(today))
test_date = datetime.date(2018, 1, 1) # 使用.date输出日期
print(test_date, type(test_date))
2019-01-27 <class 'datetime.date'>
2018-01-01 <class 'datetime.date'>
# datetime
Now = datetime.datetime.Now() # 返回当前时间戳
print(Now)
t1 = datetime.datetime(2018, 1, 1) # 构建一个20180101的时间戳
print(t1)
2019-01-27 00:55:57.328016
2018-01-01 00:00:00
# timedelta
t2 = datetime.datetime(2018, 2, 1, 15, 00, 00)
print(t2)
print(t2-t1, type(t2-t1)) # timedelta时间差
t1 = datetime.datetime(2018, 1, 1)
tx = datetime.timedelta(100) # 构建一个timedelta
print(tx, type(tx))
print(t1+tx, type(t1+tx)) # 利用timedelta进行时间加减
2018-02-01 15:00:00
31 days, 15:00:00 <class 'datetime.timedelta'>
100 days, 0:00:00 <class 'datetime.timedelta'>
2018-04-11 00:00:00 <class 'datetime.datetime'>
# 日期和字符串转换 parser.parse
# parse可接受多种时间表示格式
from dateutil.parser import parse
date = '12/21/2017'
date2 = '2001-1-1'
print(parse(date), type(parse(date)))
print(parse(date2), type(parse(date2)))
2017-12-21 00:00:00 <class 'datetime.datetime'>
2001-01-01 00:00:00 <class 'datetime.datetime'>
Pandas时间戳Timestamp
- pandas.Timestamp()
- pandas.to_datetime()
传入单个数据,转化为pandas的Timestamp;传入多个数据,转化为pandas的DatetimeIndex。
传入多个数据时,如果包含非时间序列格式数据,需要设置error参数:
1) error='ignore',不可解析时返回原始输入的ndarray;
2) or='coerce',不可解析的值使用NaT填充,返回一个DatetimeIndex
date1 = datetime.datetime(2018, 1, 1, 12, 23, 34) # 创建一个datetime
date2 = '2018-1-1' # 字符串
# 创建Pandas中的时间戳Timestamp
t1 = pd.Timestamp(date1)
t2 = pd.Timestamp(date2)
print(t1, type(t1))
print(t2, type(t2))
print(pd.Timestamp('2018-1-1 12:23:34'))
2018-01-01 12:23:34 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2018-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2018-01-01 12:23:34
# 转化为pandas中的时间
date1 = datetime.datetime(2018, 1, 1, 12, 23, 34) # 创建一个datetime
date2 = '2018-1-1' # 字符串
t1 = pd.to_datetime(date1)
t2 = pd.to_datetime(date2)
print(t1, type(t1))
print(t2, type(t2))
# 传入list,返回DatetimeIndex
list_date = ['2018-01-01', '2018-01-02', '2018-01-03']
t3 = pd.to_datetime(list_date)
print('\n', t3, type(t3))
print('\n传入一组时间序列list时包含其他格式数据,使用error')
date3 = ['2018-01-01', '2018-01-02', '2018-01-03', 'hello world', '2018-01-04']
t4 = pd.to_datetime(date3, errors='ignore')
print(t4, type(t4))
t5 = pd.to_datetime(date3, errors='coerce')
print(t5, type(t5))
2018-01-01 12:23:34 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2018-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
传入一组时间序列list时包含其他格式数据,使用error
['2018-01-01' '2018-01-02' '2018-01-03' 'hello world' '2018-01-04'] <class 'numpy.ndarray'>
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', 'NaT', '2018-01-04'], dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
Pandas时间戳索引DatetimeIndex
pd.date_range()
直接生成时间戳索引,支持str/datetime.datetime
单个时间戳为timestamp,多个时间戳为DatatimeIndex
rng = pd.DatetimeIndex(['1/1/2018', '1/2/2018', '1/3/2018'])
print(rng, type(rng))
print(rng[0], type(rng[0]))
st = pd.Series(np.random.rand(len(rng)), index=rng) # 以DatetimeIndex作为索引建立Series
print('\n',st, type(st))
print(st.index)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
2018-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2018-01-01 0.645372
2018-01-02 0.768804
2018-01-03 0.303102
dtype: float64 <class 'pandas.core.series.Series'>
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None)
通用功能
数值计算和统计基础
文本数据
合并merge、join
连接与修补concat、combine_first
去重及替换
数据分组
分组转换及一般性“拆分-应用-合并”
透视表及交叉表
文件读取
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。