pandas计算两个日期之间的差值、自定义生成时间序列、重采样

计算两个日期之间差值

假设有两个日期，我们希望计算它们之间差了多少年、多少月(要将年考虑在内)等等，该怎么办呢？

import pandas as pd

s1 = "2019-2-14 11:44:51"
s2 = "2018-5-22 12:11:13"

# 我们需要使用一个叫做pd.Period的类

# Y: 计算总共差了多少年
print(
    (pd.Period(s1, freq="Y") - pd.Period(s2, freq="Y")).n
)  # 1

# M: 计算总共差了多少月，显然要将年考虑在内，后面同理
print(
    (pd.Period(s1, freq="M") - pd.Period(s2, freq="M")).n
)  # 9

# W: 计算总共差了多少星期
print(
    (pd.Period(s1, freq="W") - pd.Period(s2, freq="W")).n
)  # 38


# D: 计算总共差了多少天
print(
    (pd.Period(s1, freq="D") - pd.Period(s2, freq="D")).n
)  # 268


# H: 计算总共差了多少小时
print(
    (pd.Period(s1, freq="H") - pd.Period(s2, freq="H")).n
)  # 6431


# T: 计算总共差了多少分钟
print(
    (pd.Period(s1, freq="T") - pd.Period(s2, freq="T")).n
)  # 385893


# S: 计算总共差了多少秒
print(
    (pd.Period(s1, freq="S") - pd.Period(s2, freq="S")).n
)  # 23153618


# ms: 计算总共差了多少毫秒
print(
    (pd.Period(s1, freq="ms") - pd.Period(s2, freq="ms")).n
)  # 23153618000


# us: 计算总共差了多少微秒
print(
    (pd.Period(s1, freq="us") - pd.Period(s2, freq="us")).n
)  # 23153618000000


# ns: 计算总共差了多少纳秒
print(
    (pd.Period(s1, freq="ns") - pd.Period(s2, freq="ns")).n
)  # 23153618000000000

生成一段时间序列

如何使用pandas生成一段自己想要的时间序列呢？

import pandas as pd


print(
    pd.date_range("2010-1-1", "2018-1-1", freq="Y")
)
"""
DatetimeIndex(['2010-12-31', '2011-12-31', '2012-12-31', '2013-12-31',
               '2014-12-31', '2015-12-31', '2016-12-31', '2017-12-31'],
              dtype='datetime64[ns]', freq='A-DEC')
"""

print(
    pd.date_range("2010-1-1", "2010-6-12", freq="M")
)
"""
DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30',
               '2010-05-31'],
              dtype='datetime64[ns]', freq='M')
"""

# 指定起始时间和结束时间，然后指定freq(和pd.Period中的freq一样)
# 会将起始时间按照freq一直往上加，直到超过结束时间
# 这里写的是Y，表示一次加1年，还可以写成2Y，一次加两年，当然3Y、4Y也是可以的


# 如果不指定freq，那么默认freq为"D"，也就是天
# 但是不指定freq，指定了periods，那么会将起始之间和结束时间进行等分
print(
    pd.date_range("2010-1-1", "2010-6-12", periods=8)
)
"""
DatetimeIndex([          '2010-01-01 00:00:00',
               '2010-01-24 03:25:42.857142857',
               '2010-02-16 06:51:25.714285714',
               '2010-03-11 10:17:08.571428572',
               '2010-04-03 13:42:51.428571429',
               '2010-04-26 17:08:34.285714286',
               '2010-05-19 20:34:17.142857144',
                         '2010-06-12 00:00:00'],
              dtype='datetime64[ns]', freq=None)
"""

# 如果指指定起始时间，不指定结束时间
# 那么要指定periods和freq，freq不仅可以写成T，还可以写成3T、4T等等，同理Y、M也是如此
print(
    pd.date_range("2010-1-1", freq="3T", periods=8)
)
"""
DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 00:03:00',
               '2010-01-01 00:06:00', '2010-01-01 00:09:00',
               '2010-01-01 00:12:00', '2010-01-01 00:15:00',
               '2010-01-01 00:18:00', '2010-01-01 00:21:00'],
              dtype='datetime64[ns]', freq='3T')
"""

上面的date_range方法，估计很多人都知道，下面看一个神奇的。

import pandas as pd


print(
    pd.to_datetime([1, 2, 5, 7],
                   origin=pd.Timestamp("2010-1-1"),
                   unit="W")
)  # DatetimeIndex(['2010-01-07', '2010-01-14', '2010-02-04', '2010-02-18'], dtype='datetime64[ns]', freq=None)

print(
    pd.to_datetime([1, 2, 5, 7],
                   origin=pd.Timestamp("2010-1-1"),
                   unit="D")
)  # DatetimeIndex(['2010-01-02', '2010-01-03', '2010-01-06', '2010-01-08'], dtype='datetime64[ns]', freq=None)

"""
pd.to_datetime这个函数的第一个参数一般是符合日期格式的字符串，但也可以是一个整型
如果是一个整型，那么就使用origin(默认是unix时间)往上加，但是整型究竟是代表年、月、日、还是天等等，就需要通过unit来指定
unit代表整型具备的含义，可选的值如下

Y: 年(不推荐)
M: 月(不推荐)
W: 星期
D: 日
h: 小时
m: 分钟
s: 秒
ms: 毫妙
us: 微妙
ns: 纳妙
"""

print(
    pd.to_datetime([1, 2, 5, 7],
                   origin=pd.Timestamp("2010-1-1"),
                   unit="m")
)
"""
DatetimeIndex(['2010-01-01 00:01:00', '2010-01-01 00:02:00',
               '2010-01-01 00:05:00', '2010-01-01 00:07:00'],
              dtype='datetime64[ns]', freq=None)

"""
# 这样的话，我们就可以自己控制要加多长时间了
# 另外我们说origin默认是unix，那么如果我们指定unit为"s"，那么也是可以将时间戳转为时间的
import time
print(
    pd.to_datetime([time.time(), time.time() - (1 << 12)], unit="s")
)  # DatetimeIndex(['2020-05-05 13:48:32.933386326', '2020-05-05 12:40:16.933386326'], dtype='datetime64[ns]', freq=None)


# 这里提一句pd.Timestamp，它类似于python中的datetime
print(pd.Timestamp(2018, 1, 1))  # 2018-01-01 00:00:00
print(pd.Timestamp(2018, 1, 1, 11, 21, 33))  # 2018-01-01 11:21:33

# 除此之外还有一个pd.timedelta,类似于python中的timedelta
# 可以像python中的timedelta一样指定weeks、hours、days等等属性，
print(pd.timedelta(hours=5))  # 0 days 05:00:00
print(pd.timedelta(weeks=5))  # 35 days 00:00:00
print(pd.timedelta(5, unit="h"))  # 0 days 05:00:00
# 还可以通过unit指定，但是同样不推荐Y和M，也就是年和月不推荐，要被移除了
print(pd.timedelta(5, unit="M"))  # 152 days 04:25:30

Timestamp转化

我们说pandas中类的日期是Timestamp类型、或者DatetimeIndex，关于这两者的区别，你可以简单的认为，如果是一个日期，那么是Timestamp类型，如果是多个Timestamp组合起来，那么整体就变成了DatetimeIndex

import pandas as pd

# 使用to_datetime得到的是DatetimeIndex
# 如果是pd.Series([日期, 日期])这种形式得到的就是Series对象，但是它的操作就没有DatetimeIndex丰富了
s = pd.to_datetime(["2019-1-1", "2018-1-1"])
print(s)  # DatetimeIndex(['2019-01-01', '2018-01-01'], dtype='datetime64[ns]', freq=None)
print(type(s[0]))  # <class 'pandas._libs.tslibs.timestamps.Timestamp'>

它们都是可以转化成其它类型的

import pandas as pd

s = ["2018年12月3号", "2019年1月11号", "2018年10月31号"]
dt = pd.to_datetime(s, format="%Y年%m月%d号")
print(dt)  # DatetimeIndex(['2018-12-03', '2019-01-11', '2018-10-31'], dtype='datetime64[ns]', freq=None)

# 转化成Period, pd.Period只能接受单个值
# 但是使用to_period，我们可以将多个日期同时转化
print(dt.to_period(freq="M"))  # Periodindex(['2018-12', '2019-01', '2018-10'], dtype='period[M]', freq='M')

s2 = ["2018年1月3号", "2019年11月11号", "2018年5月31号"]
dt2 = pd.to_datetime(s2, format="%Y年%m月%d号")
print(
    [x.n for x in dt.to_period(freq="M") - dt2.to_period(freq="M")]
)  # [11, -10, 5]

# 转成numpy
print(dt.to_numpy())
"""
['2018-12-03T00:00:00.000000000' '2019-01-11T00:00:00.000000000'
 '2018-10-31T00:00:00.000000000']
"""

# 转成python中的datetime
print(dt.to_pydatetime())
"""
[datetime.datetime(2018, 12, 3, 0, 0) datetime.datetime(2019, 1, 11, 0, 0)
 datetime.datetime(2018, 10, 31, 0, 0)]
"""

resample

resample表示采样，我们来看一下它的用法。

import pandas as pd

df = pd.DataFrame(
    {"dt": pd.date_range("1/1/2020", periods=9, freq="T"),
     "value": range(1, 10)}
)

print(df)
"""
                   dt  value
0 2020-01-01 00:00:00      1
1 2020-01-01 00:01:00      2
2 2020-01-01 00:02:00      3
3 2020-01-01 00:03:00      4
4 2020-01-01 00:04:00      5
5 2020-01-01 00:05:00      6
6 2020-01-01 00:06:00      7
7 2020-01-01 00:07:00      8
8 2020-01-01 00:08:00      9
"""

print(df.set_index("dt").resample("3T"))
# DatetimeIndexResampler [freq=<3 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]

# resample表示采样，这里表示每隔三分钟采样一次
# 如果想使用resample，那么该DataFrame对象的索引必须是日期类型
print(df.set_index("dt").resample("3T").sum())
"""
dt                        
2020-01-01 00:00:00      6
2020-01-01 00:03:00     15
2020-01-01 00:06:00     24
"""

pandas计算两个日期之间的差值、自定义生成时间序列、重采样

计算两个日期之间差值

生成一段时间序列

Timestamp转化

resample

相关推荐