如何解决在Spark SQL中使用键值AND条件的组合对Hive Map列进行过滤的问题
我有一个带有Map列类型currentValueRef.current.getAttribute("thumbwidth")
的配置单元表,该表将值存储在键值对中。
我需要结合两个键值编写过滤条件
示例数据集:
key
我需要编写一个Spark SQL查询以基于+---------------+--------------+----------------------+
| column_value | metric_name | key |
+---------------+--------------+----------------------+
| A37B | Mean | {0:"202006",1:"1"} |
| ACCOUNT_ID | Mean | {0:"202006",1:"2"} |
| ANB_200 | Mean | {0:"202006",1:"3"} |
| ANB_201 | Mean | {0:"202006",1:"4"} |
| AS82_RE | Mean | {0:"202006",1:"5"} |
| ATTR001 | Mean | {0:"202007",1:"2"} |
| ATTR001_RE | Mean | {0:"202007",1:"3"} |
| ATTR002 | Mean | {0:"202007",1:"4"} |
| ATTR002_RE | Mean | {0:"202007",1:"5"} |
| ATTR003 | Mean | {0:"202008",1:"3"} |
| ATTR004 | Mean | {0:"202008",1:"4"} |
| ATTR005 | Mean | {0:"202008",1:"5"} |
| ATTR006 | Mean | {0:"202009",1:"4"} |
| ATTR006 | Mean | {0:"202009",1:"5"} |
列进行过滤,且条件为NOT IN且两个键都相符。
Key
预期输出:
select * from table where key[0] between 202006 and 202009 and key NOT IN (0:"202009",1:"5)
解决方法
使用map()函数将NOT IN参数转换为地图:
select * from your_data
where key[0] between 202006 and 202009
and key NOT IN ( map(0,"202009",1,"5") ); --can be many map() comma separated
,
检查以下代码。
使用 Spark Scala
scala> df.show(false)
+------------+-----------+------------------+
|column_value|metric_name|key |
+------------+-----------+------------------+
|A37B |Mean |{0:"202006",1:"1"}|
|ACCOUNT_ID |Mean |{0:"202006",1:"2"}|
|ANB_200 |Mean |{0:"202006",1:"3"}|
|ANB_201 |Mean |{0:"202006",1:"4"}|
|AS82_RE |Mean |{0:"202006",1:"5"}|
|ATTR001 |Mean |{0:"202007",1:"2"}|
|ATTR001_RE |Mean |{0:"202007",1:"3"}|
|ATTR002 |Mean |{0:"202007",1:"4"}|
|ATTR002_RE |Mean |{0:"202007",1:"5"}|
|ATTR003 |Mean |{0:"202008",1:"3"}|
|ATTR004 |Mean |{0:"202008",1:"4"}|
|ATTR005 |Mean |{0:"202008",1:"5"}|
|ATTR006 |Mean |{0:"202009",1:"4"}|
|ATTR006 |Mean |{0:"202009",1:"5"}|
+------------+-----------+------------------+
创建模式以匹配key
列值。
scala> import org.apache.spark.sql.types._
scala> val schema = DataType
.fromJson("""{"type":"struct","fields":[{"name":"0","type":"string","nullable":true,"metadata":{}},{"name":"1","metadata":{}}]}""")
.asInstanceOf[StructType]
key
列的打印模式。
scala> schema.printTreeString
root
|-- 0: string (nullable = true)
|-- 1: string (nullable = true)
在DataFrame的 key 列中应用 schema json。
scala> :paste
// Convert key column values to valid json & then apply schema json.
df
.withColumn("key_new",from_json(
regexp_replace(
regexp_replace(
$"key","0:","\"0\":"
),"1:","\"1\":"
),schema
)
)
.filter(
$"key_new.0".between(202006,202009) &&
!($"key_new.0" === 202009 && $"key_new.1" === 5)
).show(false)
最终输出
+------------+-----------+------------------+-----------+
|column_value|metric_name|key |key_new |
+------------+-----------+------------------+-----------+
|A37B |Mean |{0:"202006",1:"1"}|[202006,1]|
|ACCOUNT_ID |Mean |{0:"202006",1:"2"}|[202006,2]|
|ANB_200 |Mean |{0:"202006",1:"3"}|[202006,3]|
|ANB_201 |Mean |{0:"202006",1:"4"}|[202006,4]|
|AS82_RE |Mean |{0:"202006",1:"5"}|[202006,5]|
|ATTR001 |Mean |{0:"202007",1:"2"}|[202007,2]|
|ATTR001_RE |Mean |{0:"202007",1:"3"}|[202007,3]|
|ATTR002 |Mean |{0:"202007",1:"4"}|[202007,4]|
|ATTR002_RE |Mean |{0:"202007",1:"5"}|[202007,5]|
|ATTR003 |Mean |{0:"202008",1:"3"}|[202008,3]|
|ATTR004 |Mean |{0:"202008",1:"4"}|[202008,4]|
|ATTR005 |Mean |{0:"202008",1:"5"}|[202008,5]|
|ATTR006 |Mean |{0:"202009",1:"4"}|[202009,4]|
+------------+-----------+------------------+-----------+
使用 Spark SQL
scala> spark.sql("select * from data").show(false)
+------------+-----------+------------------+
|column_value|metric_name|key |
+------------+-----------+------------------+
|A37B |Mean |{0:"202006",1:"5"}|
+------------+-----------+------------------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.sql("""
WITH table_data AS (
SELECT
column_value,metric_name,key,get_json_object(replace(replace(key,'0:','\"0\":'),'1:','\"1\":'),'$.0') as k,'$.1') as v
FROM data
)
SELECT
column_value,k,v
FROM table_data
WHERE
(k between 202006 and 202009) AND
!(k = 202009 AND V = 5)
""").show(false)
// Exiting paste mode,now interpreting.
+------------+-----------+------------------+------+---+
|column_value|metric_name|key |k |v |
+------------+-----------+------------------+------+---+
|A37B |Mean |{0:"202006",1:"1"}|202006|1 |
|ACCOUNT_ID |Mean |{0:"202006",1:"2"}|202006|2 |
|ANB_200 |Mean |{0:"202006",1:"3"}|202006|3 |
|ANB_201 |Mean |{0:"202006",1:"4"}|202006|4 |
|AS82_RE |Mean |{0:"202006",1:"5"}|202006|5 |
|ATTR001 |Mean |{0:"202007",1:"2"}|202007|2 |
|ATTR001_RE |Mean |{0:"202007",1:"3"}|202007|3 |
|ATTR002 |Mean |{0:"202007",1:"4"}|202007|4 |
|ATTR002_RE |Mean |{0:"202007",1:"5"}|202007|5 |
|ATTR003 |Mean |{0:"202008",1:"3"}|202008|3 |
|ATTR004 |Mean |{0:"202008",1:"4"}|202008|4 |
|ATTR005 |Mean |{0:"202008",1:"5"}|202008|5 |
|ATTR006 |Mean |{0:"202009",1:"4"}|202009|4 |
+------------+-----------+------------------+------+---+
scala>
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。