Pyspark转置

如何解决Pyspark转置

我具有以下格式的数据，其中包含38个测量列，分别显示了各个月，如下所示。

+---------+-----------------+-----------------+------+------------------+------------------+------------------+---------+------------------+
| Cust_No | Measure1_month1 | Measure1_month2 | .... | Measure1_month72 | Measure2_month_1 | Measure2_month_2 | ….so on | Measure2_month72 |....Measure38_month1...
+---------+-----------------+-----------------+------+------------------+------------------+------------------+---------+------------------+
|       1 |              10 |              20 | ….   |              500 |               40 |               50 | …       |                  |
|       2 |              20 |              40 | ….   |              800 |               70 |              150 | …       |                  |
+---------+-----------------+-----------------+------+------------------+------------------+------------------+---------+------------------+

我想使用PYSPARK实现以下格式。

+---------+-------+----------+----------+
| CustNum | Month | Measure1 | Measure2.......measure38 |
+---------+-------+----------+----------+
|       1 |     1 |       10 |       30 |
|       1 |     2 |       20 |       40 |
|       1 |     3 |       30 |       80 |
|       1 |     4 |       70 |       90 |
|       1 |     5 |       40 |      100 |
|       . |     . |        . |        . |
|       . |     . |        . |        . |
|       1 |    72 |      700 |       50 |
+---------+-------+----------+----------+

每个客户编号的

依此类推

您能帮我吗？

谢谢

解决方法

IIUC，您需要python.exe file.py可以通过pyspark中的import clr from System import Activator import sys import math # Managed NXOpen DLL's goes also in this directory sys.path.append('C:/Siemens/NX 11/NXBIN/python') clr.AddReference('NXOpen') import NXOpen def main() : m_Session = Activator.GetObject(NXOpen.Session,"http://127.0.0.1:4567/NXOpenSession") m_WorkPart = m_Session.Parts.Work print(m_Session.Parts.Display.FullPath) if __name__ == '__main__': main()实现这种转换

我创建了一个包含5个月数据的示例数据框

wide to long

现在生成用于堆栈操作的子句。可以用更好的方法完成，但这是最简单的示例

stack

现在实际应用堆栈操作

df = spark.createDataFrame([(1,10,20,30,40,50,50),(2,50)],['cust','Measrue1_month1','Measrue1_month2','Measrue1_month3','Measrue1_month4','Measrue1_month5','Measrue2_month1','Measrue2_month2','Measrue2_month3','Measrue2_month4','Measrue2_month5'])

如何解决Pyspark转置

解决方法

相关推荐