微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Impala Last_Value() 未按预期给出结果

如何解决Impala Last_Value() 未按预期给出结果

我在 Impala 中有一个表,其中有 Unix-Time 形式的时间信息(频率为 1 毫秒)和有关三个变量的信息,如下所示:

O(N^4 log(N))

我想重新采样数据并获取新时间窗口的最后一个值。例如,如果我想重新采样为 10 秒频率,则输出应该是 10 秒窗口的最后一个值,如下所示:

ts          Val1        Val2        Val3        
1.60669E+12 7541.76     0.55964607  267.1613        
1.60669E+12 7543.04     0.5607262   267.27805       
1.60669E+12 7543.04     0.5607241   267.22308       
1.60669E+12 7543.6797   0.56109643  267.25974       
1.60669E+12 7543.6797   0.56107396  267.30624       
1.60669E+12 7543.6797   0.56170875  267.2643    

为了得到这个结果,我正在运行以下查询

ts                      val1_Last       Val2_Last       Val3_Last   
2020-11-29 22:30:00     7541.76         0.55964607      267.1613
2020-11-29 22:30:10     7542.3994       0.5613486       267.31238
2020-11-29 22:30:20     7542.3994       0.5601791       267.22842
2020-11-29 22:30:30     7544.32         0.56069416      267.20248

我在一些论坛上读到 select distinct * from ( select ts,last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,last_value(Val2) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val2,last_value(Val3) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val3 from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts,Val1 as Val1,Val2 as Val2,Val3 as Val3 FROM Sensor_Data.Table where unit='Unit1' and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt order by ts 有时会导致问题,所以我尝试使用 LAST_VALUE()FirsT_VALUE 来达到同样的目的。查询如下:

ORDER BY DESC

在这两种情况下,我都没有得到预期的结果。重采样时间 select distinct * from ( select ts,first_value(Val1) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val1,first_value(Val2) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val2,first_value(Val3) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val3 from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts,val2 as Val2,Val3 as Val3 FROM product_sofcdtw_ops.as_operated_full_backup where unit='FCS05-09' and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt order by ts 按预期出现(窗口为 10 秒),但我在 0-9 秒、10-19 秒之间获得了 tsVal1Val2随机值,... 窗户。

这个查询在逻辑上看起来不错,我没有发现任何问题。任何人都可以解释为什么我没有使用这个查询得到正确的答案。

谢谢!!!

解决方法

问题是这一行:

last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,

您正在按同一列 ts 进行分区和排序——因此没有排序(或者更具体地说,按在整个分区中保持不变的值排序会导致任意排序)。您需要保留 原始 ts 才能完成这项工作,并将其用于订购:

select ts,last_value(Val1) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val1,last_value(Val2) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val2,last_value(Val3) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val3 
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts_10,t.*
      FROM Sensor_Data.Table t
      WHERE unit = 'Unit1' AND
            cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00'
     ) t

顺便说一句,last_value() 的问题在于,当您忽略窗口框架(窗口函数规范的 rowsrange 部分)时,它会出现意外行为。

问题在于默认规范是 range between unbounded preceding and current row,这意味着 last_value() 只是选取当前行中的值。

另一方面,first_value() 在默认框架下工作正常。但是,如果您包含显式框架,则两者是等效的。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。