如何解决如何在 pyspark 的 For 循环中插入自定义函数?
我在 Azure 数据块中面临 Spark 挑战。我有一个数据集
+------------------+----------+-------------------+---------------+
| OpptyHeaderID| OpptyID| Date|BaseAmountMonth|
+------------------+----------+-------------------+---------------+
|0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00| 4375.800000|
|0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00| 4975.000000|
+------------------+----------+-------------------+---------------+
现在我需要使用循环函数将行附加到此数据帧。我想在 pyspark 中复制以下功能。
Result = ()
for i in (1:12)
{
select a.PootyHeaderID,a.OpptyID,dateadd(MONTH,i,a.Date) as Date,BaseAmountMonth
from FinalOut
Result = Result.Append()
print(i)
}
每个附加行中的日期必须是下一个月份(滚动 12 个月)。它应该是这样的。
+------------------+----------+-------------------+---------------+
| OpptyHeaderID| OpptyID| Date|BaseAmountMonth|
+------------------+----------+-------------------+---------------+
|0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00| 4375.800000|
|0067000000i6ONPAA2|OP-0164615|2014-08-27 00:00:00| 4375.800000|
|0067000000i6ONPAA2|OP-0164615|2014-09-27 00:00:00| 4375.800000|
.
.
.
|0067000000i6ONPAA2|OP-0164615|2015-06-27 00:00:00| 4375.800000|
|0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00| 4975.000000|
|0065w0000215k5kAAA|OP-0218055|2021-01-23 00:00:00| 4975.000000|
|0065w0000215k5kAAA|OP-0218055|2021-02-23 00:00:00| 4975.000000|
.
.
.
|0065w0000215k5kAAA|OP-0218055|2021-11-23 00:00:00| 4975.000000|
+------------------+----------+-------------------+---------------+
[编辑 1]
如何根据另一个字段使间隔长度动态化?
+------------------+----------+-------------------+---------------+--------+
| OpptyHeaderID| OpptyID| Date|BaseAmountMonth|Interval|
+------------------+----------+-------------------+---------------+--------+
|0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00| 4375.800000| 12|
|0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00| 4975.000000| 7|
+------------------+----------+-------------------+---------------+--------+
解决方法
你可以分解一个时间戳序列:
import pyspark.sql.functions as F
df2 = df.withColumn(
'Date',F.expr("""
explode(
sequence(
timestamp(Date),add_months(timestamp(Date),`Interval` - 1),interval 1 month
)
)
""")
)
df2.show(99)
+------------------+----------+-------------------+---------------+--------+
| OpptyHeaderID| OpptyID| Date|BaseAmountMonth|Interval|
+------------------+----------+-------------------+---------------+--------+
|0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2014-08-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2014-09-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2014-10-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2014-11-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2014-12-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2015-01-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2015-02-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2015-03-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2015-04-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2015-05-27 00:00:00| 4375.800000| 12|
|0067000000i6ONPAA2|OP-0164615|2015-06-27 00:00:00| 4375.800000| 12|
|0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00| 4975.000000| 7|
|0065w0000215k5kAAA|OP-0218055|2021-01-23 00:00:00| 4975.000000| 7|
|0065w0000215k5kAAA|OP-0218055|2021-02-23 00:00:00| 4975.000000| 7|
|0065w0000215k5kAAA|OP-0218055|2021-03-23 00:00:00| 4975.000000| 7|
|0065w0000215k5kAAA|OP-0218055|2021-04-23 00:00:00| 4975.000000| 7|
|0065w0000215k5kAAA|OP-0218055|2021-05-23 00:00:00| 4975.000000| 7|
|0065w0000215k5kAAA|OP-0218055|2021-06-23 00:00:00| 4975.000000| 7|
+------------------+----------+-------------------+---------------+--------+
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。