在Pyspark中将每个组的总数添加为数据框中的新行

如何解决在Pyspark中将每个组的总数添加为数据框中的新行

如果我尝试为每个品牌，父级和week_num（使用总计）计算并添加总行，请参考我的上一个问题Here

这是虚拟样本：

df0 = spark.createDataFrame(
    [
        (2,"A","A2","A2web",2500),(2,"A2TV",3500),(4,"A1","A2app",5500),"AD","ADapp",2000),"B","B25","B25app",7600),"B26","B26app",5600),(5,"C","c25","c25app",2658),"c27","c27app",1100),"c28","c26app",1200),],["week_num","parent","brand","channel","usage"],)

此代码段添加每个通道的总行数

# Group by and sum to get the totals
totals = (
    df0.groupBy(["week_num","brand"])
    .agg(f.sum("usage").alias("usage"))
    .withColumn("channel",f.lit("Total"))
)

# create a temp variable to sort
totals = totals.withColumn("sort_id",f.lit(2))
df0 = df0.withColumn("sort_id",f.lit(1))

# Union dataframes,drop temp variable and show
df1 = df0.unionByName(totals).sort(["week_num","sort_id"])

df1.show()

结果：

+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
|       2|     A|   A2|  A2web| 2500|
|       2|     A|   A2|   A2TV| 3500|
|       2|     A|   A2|  Total| 6000|
|       4|     A|   A1|  A2app| 5500|
|       4|     A|   A1|  Total| 5500|
|       4|     A|   AD|  ADapp| 2000|
|       4|     A|   AD|  Total| 2000|
|       4|     B|  B25| B25app| 7600|
|       4|     B|  B25|  Total| 7600|
|       4|     B|  B26| B26app| 5600|
|       4|     B|  B26|  Total| 5600|
|       5|     C|  c25| c25app| 2658|
|       5|     C|  c25|  Total| 2658|
|       5|     C|  c27| c27app| 1100|
|       5|     C|  c27|  Total| 1100|
|       5|     C|  c28| c26app| 1200|
|       5|     C|  c28|  Total| 1200|
+--------+------+-----+-------+-----+

对于通道列也可以，为了获得如下所示的内容，我只需重复第一个过程groupby + sum，然后将结果合并回去

+--------+------+-----+-------+-----+ 
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
|       2|     A|   A2|  A2web| 2500|
|       2|     A|   A2|   A2TV| 3500|
|       2|     A|   A2|  Total| 6000|
|       2|     A|Total|       | 6000|
|       2| Total|     |       | 6000|

分两个步骤

# add brand total row
df2 = (
    df0.groupBy(["week_num","parent"])
    .agg(f.sum("usage").alias("usage"))
    .withColumn("brand",f.lit("Total"))
    .withColumn("channel",f.lit(""))
)
df2 = df1.unionByName(df2).sort(["week_num","channel"])

# add weeknum total row
df3 = (
    df0.groupBy(["week_num"])
    .agg(f.sum("usage").alias("usage"))
    .withColumn("parent",f.lit("Total"))
    .withColumn("brand",f.lit(""))
    .withColumn("channel",f.lit(""))
)
df3 = df2.unionByName(df3).sort(["week_num","channel"])

结果：

+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
|       2|     A|   A2|   A2TV| 3500|
|       2|     A|   A2|  A2web| 2500|
|       2|     A|   A2|  Total| 6000|
|       2|     A|Total|       | 6000|
|       2| Total|     |       | 6000|
|       4|     A|   A1|  A2app| 5500|
|       4|     A|   A1|  Total| 5500|
|       4|     A|   AD|  ADapp| 2000|
|       4|     A|   AD|  Total| 2000|
|       4|     A|Total|       | 7500|
|       4|     B|  B25| B25app| 7600|
|       4|     B|  B25|  Total| 7600|
|       4|     B|  B26| B26app| 5600|
|       4|     B|  B26|  Total| 5600|
|       4|     B|Total|       |13200|
|       4| Total|     |       |20700|
|       5|     C|Total|       | 4958|
|       5|     C|  c25|  Total| 2658|
|       5|     C|  c25| c25app| 2658|
|       5|     C|  c27|  Total| 1100|
+--------+------+-----+-------+-----+

第一个问题，是否有其他方法或更有效的方法而不重复？其次，如果我想始终在每个组的顶部始终显示总计，而不考虑父/品牌/渠道的字母名称，该如何排序。像这样：（这是伪数据，但我希望它足够清楚）

+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
|       2| Total|     |       | 6000|
|       2|     A|Total|       | 6000|
|       2|     A|   A2|  Total| 6000|
|       2|     A|   A2|   A2TV| 3500|
|       2|     A|   A2|  A2web| 2500|
|       4| Total|     |       |20700|
|       4|     A|Total|       | 7500|
|       4|     B|Total|       |13200|
|       4|     A|   A1|  Total| 5500| 
|       4|     A|   A1|  A2app| 5500|
|       4|     A|   AD|  Total| 2000|
|       4|     A|   AD|  ADapp| 2000|
|       4|     B|  B25|  Total| 7600|
|       4|     B|  B25| B25app| 7600|
|       4|     B|  B26|  Total| 5600|
|       4|     B|  B26| B26app| 5600|

解决方法

我认为您只需要rollup方法。

agg_df = (
    df.rollup(["week_num","parent","brand","channel"])
    .agg(F.sum("usage").alias("usage"),F.grouping_id().alias("lvl"))
    .orderBy(agg_cols)
)

agg_df.show()
+--------+------+-----+-------+-----+---+
|week_num|parent|brand|channel|usage|lvl|
+--------+------+-----+-------+-----+---+
|    null|  null| null|   null|31658| 15|
|       2|  null| null|   null| 6000|  7|
|       2|     A| null|   null| 6000|  3|
|       2|     A|   A2|   null| 6000|  1|
|       2|     A|   A2|   A2TV| 3500|  0|
|       2|     A|   A2|  A2web| 2500|  0|
|       4|  null| null|   null|20700|  7|
|       4|     A| null|   null| 7500|  3|
|       4|     A|   A1|   null| 5500|  1|
|       4|     A|   A1|  A2app| 5500|  0|
|       4|     A|   AD|   null| 2000|  1|
|       4|     A|   AD|  ADapp| 2000|  0|
|       4|     B| null|   null|13200|  3|
|       4|     B|  B25|   null| 7600|  1|
|       4|     B|  B25| B25app| 7600|  0|
|       4|     B|  B26|   null| 5600|  1|
|       4|     B|  B26| B26app| 5600|  0|
|       5|  null| null|   null| 4958|  7|
|       5|     C| null|   null| 4958|  3|
|       5|     C|  c25|   null| 2658|  1|
+--------+------+-----+-------+-----+---+
only showing top 20 rows

其余为纯化妆品。使用spark这样做可能不是一个好主意。最好在以后使用的修复工具中做到这一点。

agg_df = agg_df.withColumn("lvl",F.dense_rank().over(Window.orderBy("lvl")))

TOTAL = "Total"
agg_df = (
    agg_df.withColumn(
        "parent",F.when(F.col("lvl") == 4,TOTAL).otherwise(F.col("parent"))
    )
    .withColumn(
        "brand",F.when(F.col("lvl") == 3,TOTAL).otherwise(
            F.coalesce(F.col("brand"),F.lit(""))
        ),)
    .withColumn(
        "channel",F.when(F.col("lvl") == 2,TOTAL).otherwise(
            F.coalesce(F.col("channel"),)
)

agg_df.where(F.col("lvl") != 5).orderBy(
    "week_num",F.col("lvl").desc(),"channel"
).drop("lvl").show(500)

+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
|       2| Total|     |       | 6000|
|       2|     A|Total|       | 6000|
|       2|     A|   A2|  Total| 6000|
|       2|     A|   A2|   A2TV| 3500|
|       2|     A|   A2|  A2web| 2500|
|       4| Total|     |       |20700|
|       4|     A|Total|       | 7500|
|       4|     B|Total|       |13200|
|       4|     A|   A1|  Total| 5500|
|       4|     A|   AD|  Total| 2000|
|       4|     B|  B25|  Total| 7600|
|       4|     B|  B26|  Total| 5600|
|       4|     A|   A1|  A2app| 5500|
|       4|     A|   AD|  ADapp| 2000|
|       4|     B|  B25| B25app| 7600|
|       4|     B|  B26| B26app| 5600|
|       5| Total|     |       | 4958|
|       5|     C|Total|       | 4958|
|       5|     C|  c25|  Total| 2658|
|       5|     C|  c27|  Total| 1100|
|       5|     C|  c28|  Total| 1200|
|       5|     C|  c25| c25app| 2658|
|       5|     C|  c27| c27app| 1100|
|       5|     C|  c28| c26app| 1200|
+--------+------+-----+-------+-----+

在Pyspark中将每个组的总数添加为数据框中的新行

如何解决在Pyspark中将每个组的总数添加为数据框中的新行

解决方法

相关推荐