如何解决Impala Last_Value() 未按预期给出结果
我在 Impala 中有一个表,其中有 Unix-Time 形式的时间信息(频率为 1 毫秒)和有关三个变量的信息,如下所示:
O(N^4 log(N))
我想重新采样数据并获取新时间窗口的最后一个值。例如,如果我想重新采样为 10 秒频率,则输出应该是 10 秒窗口的最后一个值,如下所示:
ts Val1 Val2 Val3
1.60669E+12 7541.76 0.55964607 267.1613
1.60669E+12 7543.04 0.5607262 267.27805
1.60669E+12 7543.04 0.5607241 267.22308
1.60669E+12 7543.6797 0.56109643 267.25974
1.60669E+12 7543.6797 0.56107396 267.30624
1.60669E+12 7543.6797 0.56170875 267.2643
为了得到这个结果,我正在运行以下查询:
ts val1_Last Val2_Last Val3_Last
2020-11-29 22:30:00 7541.76 0.55964607 267.1613
2020-11-29 22:30:10 7542.3994 0.5613486 267.31238
2020-11-29 22:30:20 7542.3994 0.5601791 267.22842
2020-11-29 22:30:30 7544.32 0.56069416 267.20248
我在一些论坛上读到 select distinct *
from (
select ts,last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,last_value(Val2) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val2,last_value(Val3) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts,Val1 as Val1,Val2 as Val2,Val3 as Val3
FROM Sensor_Data.Table where unit='Unit1'
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt
order by ts
有时会导致问题,所以我尝试使用 LAST_VALUE()
和 FirsT_VALUE
来达到同样的目的。查询如下:
ORDER BY DESC
但在这两种情况下,我都没有得到预期的结果。重采样时间 select distinct *
from (
select ts,first_value(Val1) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val1,first_value(Val2) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val2,first_value(Val3) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts,val2 as Val2,Val3 as Val3
FROM product_sofcdtw_ops.as_operated_full_backup where unit='FCS05-09'
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt
order by ts
按预期出现(窗口为 10 秒),但我在 0-9 秒、10-19 秒之间获得了 ts
、Val1
和 Val2
的随机值,... 窗户。
这个查询在逻辑上看起来不错,我没有发现任何问题。任何人都可以解释为什么我没有使用这个查询得到正确的答案。
谢谢!!!
解决方法
问题是这一行:
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,
您正在按同一列 ts
进行分区和排序——因此没有排序(或者更具体地说,按在整个分区中保持不变的值排序会导致任意排序)。您需要保留 原始 ts 才能完成这项工作,并将其用于订购:
select ts,last_value(Val1) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val1,last_value(Val2) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val2,last_value(Val3) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts_10,t.*
FROM Sensor_Data.Table t
WHERE unit = 'Unit1' AND
cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00'
) t
顺便说一句,last_value()
的问题在于,当您忽略窗口框架(窗口函数规范的 rows
或 range
部分)时,它会出现意外行为。
问题在于默认规范是 range between unbounded preceding and current row
,这意味着 last_value()
只是选取当前行中的值。
另一方面,first_value()
在默认框架下工作正常。但是,如果您包含显式框架,则两者是等效的。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。