algorithm – 压缩unix时间戳,精确到微秒

我的文件由一系列具有微秒精度的实时unix时间戳组成,即时间戳永远不会减少.所有需要编码/解码的时间戳都来自同一天.文件中的示例条目可能类似于1364281200.078739,对应于自epoch以来的1364281200078739 usecs.数据间隔不均匀且有界.

我需要实现大约10位/时间戳的压缩.目前,我能够通过计算连续时间戳之间的差异来压缩到31位/时间戳的平均值.我怎样才能进一步提高？

编辑：

我们计算压缩程度为(编码文件的大小,以字节为单位)/(时间戳数)* 8.我把时间戳分成两部分’.’之后.整数部分非常恒定,两个整数部分时间戳之间的最大差异为32,因此我使用0-8位对其进行编码.精度部分非常随机,所以我忽略了前导位并使用0-21位写入文件(最大值可以是999999).但我的编码文件的大小为4007674字节,因此压缩为71.05位/ TS.我也写”.以及两个时间戳之间的空格以便稍后解码.如何改进编码文件的大小？

以下是部分数据集的链接 –
http://pastebin.com/QBs9Bqv0

以下是差分时间戳值的链接,以微秒为单位 –
http://pastebin.com/3QJk1NDV
最大差异b / w时间戳是 – 32594136微秒.

解决方法

如果取每个时间戳和前一个时间戳之间的间隔并以微秒(即整数)表示,则样本文件中每位深度的值分布为：

所以52.285％的值是0或1,只有少数其他值低于64(2~6位),27.59％的值是7~12位,有一个相当均匀的分布约2.1每位的百分比高达20位,仅高于20位的3％,最大为25位.
查看数据,很明显有很多连续6个零的序列.

这些观察结果给了我每个值使用可变位大小的想法,如下所示：

00 0xxxxx                             0 (xxxxx is the number of consecutive zeros)
00 1xxxxx                             1 (xxxxx is the number of consecutive ones)
01 xxxxxx xxxxxxxx                    2-14 bit values
10 xxxxxx xxxxxxxx xxxxxxxx           15-22 bit values
11 xxxxxx xxxxxxxx xxxxxxxx xxxxxxxx  23-30 bit values

快速测试表明,这导致每个时间戳的压缩率为13.78位,这不是您想要的10位,但对于简单的方案来说并不是一个糟糕的开端.

在对样本数据进行更多分析之后,我观察到有很多连续0和1的短序列,如0 1 0,所以我用这个替换了1字节的方案：

00xxxxxx    00     = identifies a one-byte value
            xxxxxx = index in the sequence table

序列表：

index ~ seq    index ~ seq    index ~ seq    index ~ seq    index ~ seq     index ~ seq
  0      0       2     00       6     000     14    0000     30   00000     62   000000
  1      1       3     01       7     001     15    0001     31   00001     63   000001
                 4     10       8     010     16    0010     32   00010    
                 5     11         ...            ...           ...
                               11     101     27    1101     59   11101    
                               12     110     28    1110     60   11110    
                               13     111     29    1111     61   11111

对于具有451,210个时间戳的示例文件,这会将编码文件大小降低到676,418字节,或每个时间戳11.99位.

测试上述方法显示,在较大的间隔之间存在98,578个单零和31,271个单个零.所以我尝试使用每个较大间隔的1位来存储它是否后跟零,这将编码大小减少到592,315字节.当我使用2位来存储更大的间隔是否跟随0,1或00(最常见的序列)时,编码的大小减少到564,034字节,或每个时间戳10.0004位.
然后我改为使用以下大间隔存储单个0和1而不是前一个(纯粹出于代码简单性的原因)并且发现这导致文件大小为563.884字节,或每个时间戳9.997722位！

所以完整的方法是：

Store the first timestamp (8 bytes),then store the intervals as either: 

00 iiiiii                             sequences of up to 5 (or 6) zeros or ones
01 XXxxxx xxxxxxxx                     2-12 bit values (2 ~ 4,095)
10 XXxxxx xxxxxxxx xxxxxxxx           13-20 bit values (4,096 ~ 1,048,575)
11 XXxxxx xxxxxxxx xxxxxxxx xxxxxxxx  21-28 bit values (1,576 ~ 268,435,455)

iiiiii = index in sequence table (see above)
XX     = preceded by a zero (if XX=1),a one (if XX=2) or two zeros (if XX=3)
xxx... = 12,20 or 28 bit value

编码器示例：

#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;

void write_timestamp(ofstream& ofile,uint64_t timestamp) {    // big-endian
    uint8_t bytes[8];
    for (int i = 7; i >= 0; i--,timestamp >>= 8) bytes[i] = timestamp;
    ofile.write((char*) bytes,8);
}

int main() {
    ifstream ifile ("timestamps.txt");
    if (! ifile.is_open()) return 1;
    ofstream ofile ("output.bin",ios::trunc | ios::binary);
    if (! ofile.is_open()) return 2;

    long double seconds;
    uint64_t timestamp;

    if (ifile >> seconds) {
        timestamp = seconds * 1000000;
        write_timestamp(ofile,timestamp);
    }

    while (! ifile.eof()) {
        uint8_t bytesize = 0,len = 0,seq = 0,bytes[4];
        uint32_t interval;

        while (bytesize == 0 && ifile >> seconds) {
            interval = seconds * 1000000 - timestamp;
            timestamp += interval;

            if (interval < 2) {
                seq <<= 1; seq |= interval;
                if (++len == 5 && seq > 0 || len == 6) bytesize = 1;
            } else {
                while (interval >> ++bytesize * 8 + 4);
                for (uint8_t i = 0; i <= bytesize; i++) {
                    bytes[i] = interval >> (bytesize - i) * 8;
                }
                bytes[0] |= (bytesize++ << 6);
            }
        }
        if (len) {
            if (bytesize > 1 && (len == 1 || len == 2 && seq == 0)) {
                bytes[0] |= (2 * len + seq - 1) << 4;
            } else {
                seq += (1 << len) - 2;
                ofile.write((char*) &seq,1);
            }
        }
        if (bytesize > 1) ofile.write((char*) bytes,bytesize);
    }
    ifile.close();
    ofile.close();
    return 0;
}

解码器示例：

#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;

uint64_t read_timestamp(ifstream& ifile) {    // big-endian
    uint64_t timestamp = 0;
    uint8_t byte;
    for (uint8_t i = 0; i < 8; i++) {
        ifile.read((char*) &byte,1);
        if (ifile.fail()) return 0;
        timestamp <<= 8; timestamp |= byte;
    }
    return timestamp;
}

uint8_t read_interval(ifstream& ifile,uint8_t *bytes) {
    uint8_t bytesize = 1;
    ifile.read((char*) bytes,1);
    if (ifile.fail()) return 0;
    bytesize += bytes[0] >> 6;
    for (uint8_t i = 1; i < bytesize; i++) {
        ifile.read((char*) bytes + i,1);
        if (ifile.fail()) return 0;
    }
    return bytesize;
}

void write_seconds(ofstream& ofile,uint64_t timestamp) {
    long double seconds = (long double) timestamp / 1000000;
    ofile << seconds << "\n";
}

uint8_t write_sequence(ofstream& ofile,uint8_t seq,uint64_t timestamp) {
    uint8_t interval = 0,len = 1,offset = 1;
    while (seq >= (offset <<= 1)) {
        seq -= offset;
        ++len;
    }
    while (len--) {
        interval += (seq >> len) & 1;
        write_seconds(ofile,timestamp + interval);
    }
    return interval;
}

int main() {
    ifstream ifile ("timestamps.bin",ios::binary);
    if (! ifile.is_open()) return 1;
    ofstream ofile ("output.txt",ios::trunc);
    if (! ofile.is_open()) return 2;
    ofile.precision(6); ofile << std::fixed;

    uint64_t timestamp = read_timestamp(ifile);
    if (timestamp) write_seconds(ofile,timestamp);

    while (! ifile.eof()) {
        uint8_t bytes[4],bytesize = read_interval(ifile,bytes);
        uint32_t interval;

        if (bytesize == 1) {
            timestamp += write_sequence(ofile,bytes[0],timestamp);
        }
        else if (bytesize > 1) {
            seq = (bytes[0] >> 4) & 3;
            if (seq) timestamp += write_sequence(ofile,seq - 1,timestamp);
            interval = bytes[0] & 15;
            for (uint8_t i = 1; i < bytesize; i++) {
                interval <<= 8; interval += bytes[i];
            }
            timestamp += interval;
            write_seconds(ofile,timestamp);
        }
    }
    ifile.close();
    ofile.close();
    return 0;
}

由于我正在使用的MinGW / gcc 4.8.1编译器中的long double output bug,我不得不使用此解决方法:(这对其他编译器来说不是必需的)

void write_seconds(ofstream& ofile,uint64_t timestamp) {
    long double seconds = (long double) timestamp / 1000000;
    ofile << "1" << (double) (seconds - 1000000000) << "\n";
}

未来读者需要注意：此方法基于对示例数据文件的分析;如果您的数据不同,它将不会给出相同的压缩率.

algorithm – 压缩unix时间戳,精确到微秒

解决方法

相关推荐