python – Pandas：每60秒bin中只保留第一行数据

在熊猫中只保留每个60秒数据仓的第一行的最佳方法是什么？即对于在增加时间t发生的每一行,我想删除最多发生在60秒内的所有行.

我知道我可以使用groupby().first()的某种组合,但我见过的代码示例(例如使用pandas.Grouper(freq = ’60s’))将丢弃原始日期时间,而不是每个从午夜偏离60秒,而不是我原来的日期时间.

例如,以下内容：

                            time        value
0  2016-05-11 13:00:10.841015028     0.215978
1  2016-05-11 13:02:05.760595780     0.155666
2  2016-05-11 13:02:05.760903860     0.155666
3  2016-05-11 13:02:18.325613076     0.157788
4  2016-05-11 13:02:18.486519052     0.157788
5  2016-05-11 13:02:20.243748548     0.157788
6  2016-05-11 13:02:20.533101692     0.157788
7  2016-05-11 13:02:20.646061652     0.157788
8  2016-05-11 13:02:21.121409820     0.157788
9  2016-05-11 13:04:24.660609068     0.211649
10 2016-05-11 13:04:24.660845612     0.211649
11 2016-05-11 13:04:24.660957596     0.211649
12 2016-05-11 13:04:24.661378132     0.211649
13 2016-05-11 13:04:24.661450628     0.211649
14 2016-05-11 13:04:24.661607044     0.211649

应该成为这样的：

                            time        value
0  2016-05-11 13:00:10.841015028     0.215978
1  2016-05-11 13:02:05.760595780     0.155666
3  2016-05-11 13:04:24.660609068     0.211649

解决方法:

解

def td60(ta):
    d = np.timedelta64(int(6e10))
    tp = ta + d
    j = 0
    yield j
    for i, tx in enumerate(ta):
        if tx > tp[j]:
            yield i
            j = i

def pir(df):
    slc = list(td60(df.time.values))
    return pd.DataFrame(df.values[slc], df.index[slc])

用法示例

pir(df)

设置定时500,000行

pop_n, smp_n = 1000000, 500000
np.random.seed([3,1415])
tidx = pd.date_range('2016-09-08', periods=pop_n, freq='5s')
tidx = np.random.choice(tidx, smp_n, False)
tidx = pd.to_datetime(tidx).sort_values()

df = pd.DataFrame(dict(time=tidx, value=np.random.rand(smp_n)))

定时

Cythonize
在Jupyter

%load_ext Cython

%%cython
import numpy as np
import pandas as pd

def td60(ta):
    d = np.timedelta64(int(6e10))
    tp = ta + d
    j = 0
    yield j
    for i, tx in enumerate(ta):
        if tx > tp[j]:
            yield i
            j = i

def pir(df):
    slc = list(td60(df.time.values))
    return pd.DataFrame(df.values[slc], df.index[slc])

在Cythonizing之后
差别不大

OP示例的参考设置

from StringIO import StringIO
import pandas as pd

text = """time,value
2016-05-11 13:00:10.841015028,0.215978
2016-05-11 13:02:05.760595780,0.155666
2016-05-11 13:02:05.760903860,0.155666
2016-05-11 13:02:18.325613076,0.157788
2016-05-11 13:02:18.486519052,0.157788
2016-05-11 13:02:20.243748548,0.157788
2016-05-11 13:02:20.533101692,0.157788
2016-05-11 13:02:20.646061652,0.157788
2016-05-11 13:02:21.121409820,0.157788
2016-05-11 13:04:24.660609068,0.211649
2016-05-11 13:04:24.660845612,0.211649
2016-05-11 13:04:24.660957596,0.211649
2016-05-11 13:04:24.661378132,0.211649
2016-05-11 13:04:24.661450628,0.211649
2016-05-11 13:04:24.661607044,0.211649"""

df = pd.read_csv(StringIO(text), parse_dates=[0])

python – Pandas：每60秒bin中只保留第一行数据

相关推荐