如何解决Scala DataFrame-如何仅打印具有最大值的行
我有以下数据框df
:
+----------+--------+---------+
| ID| text | count|
+----------+--------+---------+
| 3| word| 316|
| 3| work| 385|
| 3| want| 205|
| 3| cat| 251|
| 1| office| 343|
| 1| sky| 643|
| 1| going| 126|
| 2| home| 124|
| 2| school| 23|
| 2| sleep| 103|
//and so on
现在,对于每个 ID
,我只想显示具有最大2 counts
的行,并拖放/隐藏其余的行:
+----------+--------+---------+
| ID| text | count|
+----------+--------+---------+
| 3| word| 316|
| 3| work| 385|
| 1| office| 343|
| 1| sky| 643|
| 2| home| 124|
| 2| sleep| 103|
//and so on
我们如何最有效地实现这一目标??
解决方法
在spark中使用窗口功能,在partitionBy
上ID
上使用orderBy
。{p}
示例:
count
,
检查以下代码。
val df=Seq((3,"word",316),(3,"work",385),"want",205),"cat",251),(1,"office",343),"sky",643),"going",126),(2,"home",124),"school",23),"sleep",103)).toDF("ID","text","count")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val w=Window.partitionBy(col("ID")).orderBy(desc("count"))
df.withColumn("rn",row_number().over(w)).filter(col("rn") <=2).drop("rn").show()
//+---+------+-----+
//| ID| text|count|
//+---+------+-----+
//| 1| sky| 643|
//| 1|office| 343|
//| 3| work| 385|
//| 3| word| 316|
//| 2| home| 124|
//| 2| sleep| 103|
//+---+------+-----+
scala> df.show(false)
+---+------+-----+
|ID |text |count|
+---+------+-----+
|3 |word |316 |
|3 |work |385 |
|3 |want |205 |
|3 |cat |251 |
|1 |office|343 |
|1 |sky |643 |
|1 |going |126 |
|2 |home |124 |
|2 |school|23 |
|2 |sleep |103 |
+---+------+-----+
scala> import org.apache.spark.sql.expressions._
scala> val windowSpec = Window
.partitionBy($"id")
.orderBy($"ID".asc,$"count".desc)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。