微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

数据分析师课程02_Pandas

Pandas——数据分析核心工具包

基于Numpy构建,为数据分析而存在

  • Series + Dataframe
  • 可直接读取数据并做处理(简单高效)
  • 兼容各种数据库
  • 支持各种分析算法
import numpy as np
import pandas as pd

数据结构

Pandas所有数据结构都带有index

Series

可以理解为一个标签的一维数组,可以保存任何数据类型(整数、字符串、浮点数、Python对象等(),轴标签统称为索引

.index查看Series的索引,返回一个RangeIndex生成
.values查看Series的值,类型是ndarray

  • Series同ndarray比较,是一个自带索引index的数组(一维数组+对应索引)
  • Series的索引/切片同ndarray类似
  • Series同dict比较,是一个有顺序的dict,其索引与值对应类似dict中的键值对应
ar = np.random.rand(5)
s = pd.Series(ar)
print(ar)
print(s, type(s))
print('--------------')
print(list(s.index), type(s.index)) # index
print(s.values, type(s.values)) # 值
[0.32461186 0.63422701 0.51008673 0.16219166 0.40639174]
0    0.324612
1    0.634227
2    0.510087
3    0.162192
4    0.406392
dtype: float64 <class 'pandas.core.series.Series'>
--------------
[0, 1, 2, 3, 4] <class 'pandas.core.indexes.range.RangeIndex'>
[0.32461186 0.63422701 0.51008673 0.16219166 0.40639174] <class 'numpy.ndarray'>

创建Series

dtype为存储数据类型,name为别名

# 由字典创建,key为index,values为values
dic = {'a': 1, 'b':2, 'c':3}
s = pd.Series(dic)
print(s, type(s))
a    1
b    2
c    3
dtype: int64 <class 'pandas.core.series.Series'>
# 由一维数组创建,
arr = np.random.rand(10)
s = pd.Series(arr, index=list('abcdefghjk'), dtype=np.str, name='test')
print(s, type(s))
a    0.32061579565266873
b     0.6422417901855576
c     0.4752166664686672
d    0.14271215219716993
e     0.3484167803947562
f    0.02810385749477773
g     0.4921923545502085
h     0.3364856517354894
j     0.5452820708357551
k     0.6106163951939324
Name: test, dtype: object <class 'pandas.core.series.Series'>
# 通过标量创建
s = pd.Series(100, index=range(4))
print(s)
0    100
1    100
2    100
3    100
dtype: int64
# .rename()重命名一个数组的名称,并指向一个新的数组,原数组不变

print(s)
s2 = s.rename('hhh')
print(s2)
0    100
1    100
2    100
3    100
dtype: int64
0    100
1    100
2    100
3    100
Name: hhh, dtype: int64

Series索引

# 下标索引
# 类似于list,但不完全相同,比如没有-1

s = pd.Series(np.random.rand(10))
print(s)
print(s[5], type(s[5])) # 返回一个float64数值|
0    0.729838
1    0.095203
2    0.180626
3    0.187282
4    0.390732
5    0.417309
6    0.153421
7    0.209588
8    0.921143
9    0.453336
dtype: float64
0.4173088101449465 <class 'numpy.float64'>
# 标签索引

s = pd.Series(np.random.rand(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print(s['a'], type(s['a']))
print('-------------')
# 选取多个标签生成新的数组
print(s[['a', 'b', 'e']])
a    0.374045
b    0.506414
c    0.756893
d    0.348560
e    0.675542
dtype: float64
0.37404462792248505 <class 'numpy.float64'>
-------------
a    0.374045
b    0.506414
e    0.675542
dtype: float64
Series切片

Series使用数字下标切片,左闭右开;使用标签切片,左闭右闭。
(如果Series的标签为数字,则同数字下标,左闭右开)

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index=['a', 'b', 'c', 'd', 'e'])
print(s1[1:4], '\n', s1[4])
print('------------')
print(s2['a':'c'], '\n', s2['c'])
1    0.137954
2    0.237755
3    0.990893
dtype: float64 
 0.9702501216319398
------------
a    0.299157
b    0.435536
c    0.236996
dtype: float64 
 0.23699606339117407
# 布尔型索引

s = pd.Series(np.random.rand(3))
s[4] = None # 添加一个空值
print(s)
print('\n', s>0.5) # 数组做判断,返回一个由bool值组成的新数组
print('\n', s[s>0.5]) # 布尔型索引
0    0.733328
1    0.901651
2    0.371391
4        None
dtype: object

 0     True
1     True
2    False
4    False
dtype: bool

 0    0.733328
1    0.901651
dtype: object

数据查看修改

# 数据查看
s = pd.Series(np.random.rand(50))
print(s.head(6)) # .head()查看头部数据
print(s.tail(5)) # .tail()查看尾部数据
0    0.314349
1    0.588733
2    0.103402
3    0.343712
4    0.643559
5    0.658695
dtype: float64
45    0.249720
46    0.322811
47    0.378098
48    0.004036
49    0.028688
dtype: float64
重新索引 reindex

目的是重新为当前Series设置一个新的索引
reindex提取当前数组符合条件的数据返回一个新的数组,条件不满足的行认为数据缺失

s = pd.Series(np.random.rand(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
s1 = s.reindex(['c', 'd', 'e', 'f', 'g'])
print(s)
print('\n', s1)
s2 = s.reindex(['c', 'd', 'e', 'f', 'g'], fill_value=0) # 将缺失值填充为0
print('\n', s2)
a    0.998153
b    0.290354
c    0.423444
d    0.717207
e    0.876524
dtype: float64
a    0.998153
b    0.290354
c    0.423444
d    0.717207
e    0.876524
dtype: float64

 c    0.423444
d    0.717207
e    0.876524
f         NaN
g         NaN
dtype: float64

 c    0.423444
d    0.717207
e    0.876524
f    0.000000
g    0.000000
dtype: float64
对齐

Series之间按照索引对齐做加减,索引一致则做运算,只出现一次的索引的值则为空

s1 = pd.Series(np.random.rand(3), index=['jack', 'marry', 'tom'])
s2 = pd.Series(np.random.rand(3), index=['wang', 'marry', 'tom'])
print(s1, s2)
print('\n', s1+s2)
jack     0.970609
marry    0.348323
tom      0.965782
dtype: float64 wang     0.598366
marry    0.088275
tom      0.262476
dtype: float64

 jack          NaN
marry    0.436598
tom      1.228258
wang          NaN
dtype: float64
删除添加
s = pd.Series(np.random.rand(5), index=list('abcde'))
print(s,'\n')
s1 = s.drop('b')
print(s1)
print(s)
print('\n')
s.drop(['c', 'e'], inplace=True)
print(s)
a    0.429560
b    0.425173
c    0.102780
d    0.529564
e    0.100613
dtype: float64 

a    0.429560
c    0.102780
d    0.529564
e    0.100613
dtype: float64
a    0.429560
b    0.425173
c    0.102780
d    0.529564
e    0.100613
dtype: float64


a    0.429560
b    0.425173
d    0.529564
dtype: float64
# 添加

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index=list('asdfg'))
print(s1)
print(s2, '\n')
s1[5] = 100 # 通过下标索引添加
print(s1)
s2['e'] = 100 # 通过标签索引添加
print(s2, '\n')

s3 = s1.append(s2) # append()返回一个新的Series
print(s3)
print(s1)
0    0.985752
1    0.476859
2    0.339323
3    0.529883
4    0.026883
dtype: float64
a    0.256522
s    0.828174
d    0.317496
f    0.118743
g    0.631222
dtype: float64 

0      0.985752
1      0.476859
2      0.339323
3      0.529883
4      0.026883
5    100.000000
dtype: float64
a      0.256522
s      0.828174
d      0.317496
f      0.118743
g      0.631222
e    100.000000
dtype: float64 

0      0.985752
1      0.476859
2      0.339323
3      0.529883
4      0.026883
5    100.000000
a      0.256522
s      0.828174
d      0.317496
f      0.118743
g      0.631222
e    100.000000
dtype: float64
0      0.985752
1      0.476859
2      0.339323
3      0.529883
4      0.026883
5    100.000000
dtype: float64

Dataframe

Dataframe是一个表格型的数据结构,包含一组有序的列。其列的值类型可以是数值、字符串等。
Dataframe可以理解为一个“带有标签的二维数组”,具有index(行标签)和columns(列标签)。

创建DataFrame

使用list组成的dict创建
  • key为Dataframe的列标签columns,values为Dataframe的数值values,dict间长度需要保持一致
  • columns参数可以重新指定列的顺序,格式为list。如果columns传入一个不存在的新列名,产生NaN。
  • index参数重新指定DataFrame的index,格式为list。长度需要和DataFrame一致
data = {
    'name': ['jack', 'tom', 'harry'],
    'age': [4, 5, 6],
    'gender': ['m', 'f', 'm']
}
df = pd.DataFrame(data)
print(df)
print(type(df)) # DataFrame类型
print(df.index, type(df.index)) # index
print(df.columns, type(df.columns)) # columns
print(df.values, type(df.values)) # values
    name  age gender
0   jack    4      m
1    tom    5      f
2  harry    6      m
<class 'pandas.core.frame.DataFrame'>
RangeIndex(start=0, stop=3, step=1) <class 'pandas.core.indexes.range.RangeIndex'>
Index(['name', 'age', 'gender'], dtype='object') <class 'pandas.core.indexes.base.Index'>
[['jack' 4 'm']
 ['tom' 5 'f']
 ['harry' 6 'm']] <class 'numpy.ndarray'>
df1 = pd.DataFrame(data, columns=['age', 'gender', 'name'], index=['f1', 'f2', 'f3'])
print(df1)
    age gender   name
f1    4      m   jack
f2    5      f    tom
f3    6      m  harry
使用Series组成的dict创建

由Series组成的dict进行创建,key为DataFrame的columns,Series的标签为DataFrame的index。
如果各Series长度不一致,生成的DataFrame使用NaN填充

data1 = {
    'one': pd.Series(np.random.rand(2)),
    'two': pd.Series(np.random.rand(3))
}
print(data1, '\n')
df1 = pd.DataFrame(data1) # 两个Series长度不一致,使用NaN填充
print(df1)
{'one': 0    0.376186
1    0.704161
dtype: float64, 'two': 0    0.989256
1    0.418456
2    0.854511
dtype: float64} 

        one       two
0  0.376186  0.989256
1  0.704161  0.418456
2       NaN  0.854511
使用二维数组创建

使用二维数据进行创建,得到一个同样形状的DataFrame。
如果不指定index和columns,二者均认为数字格式;如果指定index和columns,二者的长度需和原数组一致

ar = np.random.rand(9).reshape(3, 3)
print(ar, '\n')
df1 = pd.DataFrame(ar)
print(df1, '\n')
df2 = pd.DataFrame(ar,
                  index=['a', 'b', 'c'],
                  columns=['one', 'two', 'three']) # 创建时指定index和columns
print(df2)
[[0.81038884 0.28727062 0.43923942]
 [0.66731215 0.27171132 0.34258084]
 [0.04433758 0.71291395 0.75949802]] 

          0         1         2
0  0.810389  0.287271  0.439239
1  0.667312  0.271711  0.342581
2  0.044338  0.712914  0.759498 

        one       two     three
a  0.810389  0.287271  0.439239
b  0.667312  0.271711  0.342581
c  0.044338  0.712914  0.759498
使用dict组成的list创建

dict的key为DataFrame的columns,values为DataFrame的values

data = [
    {'one':1, 'two': 2},
    {'one':3, 'two':5, 'three':8}
]
print(data, '\n')
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data, index=['a', 'b']) # 设置index
df3 = pd.DataFrame(data, columns=['one', 'two']) # 设置columns
print(df1, '\n')
print(df2, '\n')
print(df3)
[{'one': 1, 'two': 2}, {'one': 3, 'two': 5, 'three': 8}] 

   one  three  two
0    1    NaN    2
1    3    8.0    5 

   one  three  two
a    1    NaN    2
b    3    8.0    5 

   one  two
0    1    2
1    3    5
使用dict组成的dict创建

涉及到层次索引/多维标签

  • 字典的key为DataFrame的columns,子字典的key为DataFrame的index
  • columns参数可以增减现有列,新列以NaN填充
  • 不可改变原有index
data = {
    'jack': {'math':90, 'eng':86, 'art':78},
    'marry': {'math':90, 'eng':87, 'art':70},
    'tom': {'math':78, 'eng':83}
}
df1 = pd.DataFrame(data)
print(df1, '\n')
df2 = pd.DataFrame(data, columns=['jack', 'tom', 'bob'])
print(df2, '\n')
# df3 = pd.DataFrame(data, index=['a', 'b', 'c'])
# print(df3)
      jack  marry   tom
art     78     70   NaN
eng     86     87  83.0
math    90     90  78.0 

      jack   tom  bob
art     78   NaN  NaN
eng     86  83.0  NaN
math    90  78.0  NaN 

DataFrame索引

DataFrame既有行索引又有列索引,可以看作由Series组成的字典(共用一个索引)

选择行与列
  • 选择列
    根据columns选取列,使用df[]
  • 选择行
    根据index选取行,使用.loc
  • 选择多个行与列
    .loc[]进行标签索引进行选取
    .iloc[]根据下标索引进行选取

df[]认选择列,传入数字则选择行,并且只能进行切片选择,返回一个DataFrame
df[]不可通过标签索引来选择行(df['one'])

df = pd.DataFrame(np.random.rand(12).reshape(3, 4)*100,
                 index=['one', 'two', 'three'],
                 columns=['a', 'b', 'c', 'd'])
print(df, '\n')

# 选择列
df1 = df['a'] # 选择一列,返回Series
print(df1, type(df1))
df2 = df[['a', 'b']] # 选择多列,返回DataFrame
print(df2, type(df2))
print('--------------\n')

# 选择行
df3 = df.loc['one'] # 选择一行,返回Series
print(df3, type(df3))
df4 = df.loc[['one', 'two']] # 选择多行,返回DataFrame
print(df4, type(df4))
print('---------------\n')
df5 = df[0:1] # 使用df[]下标索引切片选择行,左闭右开,返回DataFrame
print('\n', df5)
df6 = df.loc['one': 'two'] # 使用标签索引切片选择行,左闭右闭,返回DataFrame
print(df6, '\n')

df7 = df.iloc[0] # 使用iloc选择第一行
print(df7)
df8 = df.iloc[0, 1] # 使用iloc选择第一行第二列的值
print(df8)
               a          b          c          d
one    16.083085  39.622660  26.864423  80.290848
two    67.677411  19.627380  74.511148  42.858250
three  31.898274   7.567095  21.909344  52.840104 

one      16.083085
two      67.677411
three    31.898274
Name: a, dtype: float64 <class 'pandas.core.series.Series'>
               a          b
one    16.083085  39.622660
two    67.677411  19.627380
three  31.898274   7.567095 <class 'pandas.core.frame.DataFrame'>
--------------

a    16.083085
b    39.622660
c    26.864423
d    80.290848
Name: one, dtype: float64 <class 'pandas.core.series.Series'>
             a         b          c          d
one  16.083085  39.62266  26.864423  80.290848
two  67.677411  19.62738  74.511148  42.858250 <class 'pandas.core.frame.DataFrame'>
---------------


              a         b          c          d
one  16.083085  39.62266  26.864423  80.290848
             a         b          c          d
one  16.083085  39.62266  26.864423  80.290848
two  67.677411  19.62738  74.511148  42.858250 

a    16.083085
b    39.622660
c    26.864423
d    80.290848
Name: one, dtype: float64
39.62265968392962
布尔型索引

同Series

df = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100,
                 index=['one', 'two', 'three', 'four'],
                 columns=['a', 'b', 'c', 'd'])
print(df)
print('-------------')

df1 = df<20
print(df1, type(df1)) # 返回一个包含True/False值的DataFrame
print(df[df1]) # 返回包含满足条件的值的DataFrame
               a          b          c          d
one    19.945788  12.426571  84.385131  64.330791
two    16.446707  49.851884  50.606928  53.838039
three   2.429324  47.543116  30.089095  19.411060
four   13.263280  13.640146  92.664063  95.811193
-------------
          a      b      c      d
one    True   True  False  False
two    True  False  False  False
three  True  False  False   True
four   True   True  False  False <class 'pandas.core.frame.DataFrame'>
               a          b   c         d
one    19.945788  12.426571 NaN       NaN
two    16.446707        NaN NaN       NaN
three   2.429324        NaN NaN  19.41106
four   13.263280  13.640146 NaN       NaN

DataFeame数据查看

数据查看、转置
  • .head()查看头部信息
  • .tail()查看尾部信息
  • .T进行转置
df = pd.DataFrame(np.random.rand(16).reshape(8, 2)*100,
                 columns=['a', 'b'])
print(df.head(2))
print(df.tail())
print(df.T)
           a          b
0  27.428915  12.294751
1  81.578430  77.379900
           a          b
3   8.738676  67.126425
4  73.455421  66.751584
5   5.505302  13.314915
6  85.449624   7.665033
7  62.567230   5.998243
           0         1          2          3          4          5          6  \
a  27.428915  81.57843  68.570921   8.738676  73.455421   5.505302  85.449624   
b  12.294751  77.37990  83.701388  67.126425  66.751584  13.314915   7.665033   

           7  
a  62.567230  
b   5.998243  
添加/修改/删除

根据索引进行赋值

df = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100,
                 index=['one', 'two', 'three', 'four'],
                 columns=['a', 'b', 'c', 'd'])
print(df)

print('\n新增行/列并赋值')
df['e'] = 10
df.loc[4] = 20
print(df)
print('\n索引后直接修改值')
df['e'] = 20
df[['a', 'c']] = 100
print(df)
df.iloc[::2] = 101
print(df)
               a          b          c          d
one    87.069143  40.209805  56.695802  41.019586
two    29.164346  66.945065  18.920092  75.434930
three  87.804254  18.652358  33.611541  62.496783
four   65.088697   2.681203  52.302716  64.559536

新增行/列并赋值
               a          b          c          d   e
one    87.069143  40.209805  56.695802  41.019586  10
two    29.164346  66.945065  18.920092  75.434930  10
three  87.804254  18.652358  33.611541  62.496783  10
four   65.088697   2.681203  52.302716  64.559536  10
4      20.000000  20.000000  20.000000  20.000000  20

索引后直接修改值
         a          b    c          d   e
one    100  40.209805  100  41.019586  20
two    100  66.945065  100  75.434930  20
three  100  18.652358  100  62.496783  20
four   100   2.681203  100  64.559536  20
4      100  20.000000  100  20.000000  20
         a           b    c           d    e
one    101  101.000000  101  101.000000  101
two    100   66.945065  100   75.434930   20
three  101  101.000000  101  101.000000  101
four   100    2.681203  100   64.559536   20
4      101  101.000000  101  101.000000  101
# 删除,del/drop
df = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100,
                 columns=['a', 'b', 'c', 'd'])
print(df)
del df['a']
print(df)

print('\ndrop删除行。\n生成一个新DataFrame,设置inplace=True在原DataFrame上修改')
print(df.drop(0))
print(df)
print('设置inplace=True')
df.drop(0, inplace=True)
print(df)

print('\ndrop删除列,设置axis=1')
print(df.drop(['b'], axis=1))
print(df)
           a          b          c          d
0  48.667648  89.599096  26.993105  83.242703
1  69.105642  20.474981  96.408243  97.863509
2   9.583234  67.181335  56.180255  21.870587
3  44.272351  32.549130  93.306515  87.357004
           b          c          d
0  89.599096  26.993105  83.242703
1  20.474981  96.408243  97.863509
2  67.181335  56.180255  21.870587
3  32.549130  93.306515  87.357004

drop删除行。
生成一个新DataFrame,设置inplace=True在原DataFrame上修改
           b          c          d
1  20.474981  96.408243  97.863509
2  67.181335  56.180255  21.870587
3  32.549130  93.306515  87.357004
           b          c          d
0  89.599096  26.993105  83.242703
1  20.474981  96.408243  97.863509
2  67.181335  56.180255  21.870587
3  32.549130  93.306515  87.357004
设置inplace=True
           b          c          d
1  20.474981  96.408243  97.863509
2  67.181335  56.180255  21.870587
3  32.549130  93.306515  87.357004

drop删除列,设置axis=1
           c          d
1  96.408243  97.863509
2  56.180255  21.870587
3  93.306515  87.357004
           b          c          d
1  20.474981  96.408243  97.863509
2  67.181335  56.180255  21.870587
3  32.549130  93.306515  87.357004
对齐

DataFrame之间的数据自动按照columns和index进行对齐

df1 = pd.DataFrame(np.random.rand(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(7, 3), columns=['A', 'B', 'C'])
print(df1+df2)
          A         B         C   D
0  1.153285  0.708743  1.664708 NaN
1  0.824788  0.529762  0.800729 NaN
2  1.032009  1.116637  0.038324 NaN
3  0.442938  1.965626  1.041656 NaN
4  1.833386  0.865264  0.325347 NaN
5  0.396912  1.022707  1.414600 NaN
6  0.764291  0.686421  1.648528 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN
排序
  • 按值排序,sort_values
    ascending参数用于设置升序/降序排列
  • 按索引排序,sort_index
df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4),
                  columns=['a', 'b', 'c', 'd'])
print(df1)
print(df1.sort_values(['a'], ascending=False)) # 降序
print(df1.sort_values(['a'], ascending=True)) # 升序
print('\n多值排序')
print(df1.sort_values(['a', 'c']))
          a         b         c         d
0  0.693605  0.309871  0.428986  0.761431
1  0.072342  0.694657  0.177274  0.022500
2  0.259270  0.988083  0.568393  0.361062
3  0.350362  0.063009  0.463876  0.573203
          a         b         c         d
0  0.693605  0.309871  0.428986  0.761431
3  0.350362  0.063009  0.463876  0.573203
2  0.259270  0.988083  0.568393  0.361062
1  0.072342  0.694657  0.177274  0.022500
          a         b         c         d
1  0.072342  0.694657  0.177274  0.022500
2  0.259270  0.988083  0.568393  0.361062
3  0.350362  0.063009  0.463876  0.573203
0  0.693605  0.309871  0.428986  0.761431

多值排序
          a         b         c         d
1  0.072342  0.694657  0.177274  0.022500
2  0.259270  0.988083  0.568393  0.361062
3  0.350362  0.063009  0.463876  0.573203
0  0.693605  0.309871  0.428986  0.761431
# 索引排序
df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4)*100,
                  index=[5, 4, 3, 2],
                  columns=['a', 'b', 'c', 'd'])
print(df1)
print(df1.sort_index())
           a          b          c          d
5  99.185703  98.500810  19.644985  75.354804
4   3.602962  61.132418  45.643154  19.329648
3  71.545548  37.602546  84.432429  40.740473
2  55.051512  25.530674  74.241117  94.541445
           a          b          c          d
2  55.051512  25.530674  74.241117  94.541445
3  71.545548  37.602546  84.432429  40.740473
4   3.602962  61.132418  45.643154  19.329648
5  99.185703  98.500810  19.644985  75.354804

时间序列

时间模块datetime

  • .date()
    .date()接受(年,月,日),返回一个日期
    date.today(),返回当前日期
  • .datetime()
    datetime.datetime()返回一个时间戳
    datetime.datetime.Now(),返回当前时间戳
  • .timedelta()
import datetime

today = datetime.date.today() # 返回今天日期
print(today, type(today))

test_date = datetime.date(2018, 1, 1) # 使用.date输出日期
print(test_date, type(test_date))
2019-01-27 <class 'datetime.date'>
2018-01-01 <class 'datetime.date'>
# datetime
Now = datetime.datetime.Now() # 返回当前时间戳
print(Now)
t1 = datetime.datetime(2018, 1, 1) # 构建一个20180101的时间戳
print(t1)
2019-01-27 00:55:57.328016
2018-01-01 00:00:00
# timedelta
t2 = datetime.datetime(2018, 2, 1, 15, 00, 00)
print(t2)
print(t2-t1, type(t2-t1)) # timedelta时间差

t1 = datetime.datetime(2018, 1, 1)
tx = datetime.timedelta(100) # 构建一个timedelta
print(tx, type(tx))
print(t1+tx, type(t1+tx)) # 利用timedelta进行时间加减
2018-02-01 15:00:00
31 days, 15:00:00 <class 'datetime.timedelta'>
100 days, 0:00:00 <class 'datetime.timedelta'>
2018-04-11 00:00:00 <class 'datetime.datetime'>
# 日期和字符串转换 parser.parse
# parse可接受多种时间表示格式
from dateutil.parser import parse

date = '12/21/2017'
date2 = '2001-1-1'
print(parse(date), type(parse(date)))
print(parse(date2), type(parse(date2)))
2017-12-21 00:00:00 <class 'datetime.datetime'>
2001-01-01 00:00:00 <class 'datetime.datetime'>

Pandas时间戳Timestamp

  • pandas.Timestamp()
  • pandas.to_datetime()
    传入单个数据,转化为pandas的Timestamp;传入多个数据,转化为pandas的DatetimeIndex。
    传入多个数据时,如果包含非时间序列格式数据,需要设置error参数:
    1) error='ignore',不可解析时返回原始输入的ndarray
    2) or='coerce',不可解析的值使用NaT填充,返回一个DatetimeIndex
date1 = datetime.datetime(2018, 1, 1, 12, 23, 34) # 创建一个datetime
date2 = '2018-1-1' # 字符串

# 创建Pandas中的时间戳Timestamp
t1 = pd.Timestamp(date1) 
t2 = pd.Timestamp(date2)
print(t1, type(t1))
print(t2, type(t2))
print(pd.Timestamp('2018-1-1 12:23:34'))
2018-01-01 12:23:34 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2018-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2018-01-01 12:23:34
# 转化为pandas中的时间
date1 = datetime.datetime(2018, 1, 1, 12, 23, 34) # 创建一个datetime
date2 = '2018-1-1' # 字符串

t1 = pd.to_datetime(date1)
t2 = pd.to_datetime(date2)
print(t1, type(t1))
print(t2, type(t2))

# 传入list,返回DatetimeIndex
list_date = ['2018-01-01', '2018-01-02', '2018-01-03']
t3 = pd.to_datetime(list_date)
print('\n', t3, type(t3))

print('\n传入一组时间序列list时包含其他格式数据,使用error')
date3 = ['2018-01-01', '2018-01-02', '2018-01-03', 'hello world', '2018-01-04']
t4 = pd.to_datetime(date3, errors='ignore')
print(t4, type(t4))
t5 = pd.to_datetime(date3, errors='coerce')
print(t5, type(t5))
2018-01-01 12:23:34 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2018-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

 DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

传入一组时间序列list时包含其他格式数据,使用error
['2018-01-01' '2018-01-02' '2018-01-03' 'hello world' '2018-01-04'] <class 'numpy.ndarray'>
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', 'NaT', '2018-01-04'], dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

Pandas时间戳索引DatetimeIndex

pd.date_range()
直接生成时间戳索引,支持str/datetime.datetime
单个时间戳为timestamp,多个时间戳为DatatimeIndex

rng = pd.DatetimeIndex(['1/1/2018', '1/2/2018', '1/3/2018'])
print(rng, type(rng))
print(rng[0], type(rng[0]))

st = pd.Series(np.random.rand(len(rng)), index=rng) # 以DatetimeIndex作为索引建立Series
print('\n',st, type(st))
print(st.index)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
2018-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

 2018-01-01    0.645372
2018-01-02    0.768804
2018-01-03    0.303102
dtype: float64 <class 'pandas.core.series.Series'>
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None)

通用功能

数值计算和统计基础

文本数据

合并merge、join

连接与修补concat、combine_first

去重及替换

数据分组

分组转换及一般性“拆分-应用-合并”

透视表及交叉表

文件读取

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐