如何解决堆放,堆放,融化,旋转,转置?将多列转换为行PySpark或Pandas的简单方法是什么?
我的工作环境主要使用PySpark,但是进行一些谷歌搜索,在PySpark中转置非常复杂。我想将其保留在PySpark中,但是如果在Pandas中进行操作要容易得多,我将把Spark数据框转换为Pandas数据框。我认为,在性能是一个问题的地方,数据集并不是很大。
我想将具有多列的数据框转换为行:
输入:
import pandas as pd
df = pd.DataFrame({'Record': {0: 1,1: 2,2: 3},'Hospital': {0: 'Red Cross',1: 'Alberta Hospital',2: 'General Hospital'},'Hospital Address': {0: '1234 Street 429',1: '553 Alberta Road 441',2: '994 Random Street 923'},'Medicine_1': {0: 'Effective',1: 'Effecive',2: 'normal'},'Medicine_2': {0: 'Effective',1: 'normal',2: 'Effective'},'Medicine_3': {0: 'normal','Medicine_4': {0: 'Effective',1: 'Effective',2: 'Effective'}})
Record Hospital Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4
1 Red Cross 1234 Street 429 Effective Effective normal Effective
2 Alberta Hospital 553 Alberta Road 441 Effecive normal normal Effective
3 General Hospital 994 Random Street 923 normal Effective normal Effective
输出:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
在查看PySpark示例时,情况很复杂:PySpark Dataframe melt columns into rows
再看看熊猫的例子,它看起来要容易得多。但是有很多不同的Stack Overflow答案,有人说使用 枢轴旋转,融化,堆叠,取消堆叠 ,还有更多的结果使人迷惑。
因此,如果有人在PySpark中有简便的方法可以做到这一点,那么我将不胜枚举。如果没有,我会很乐意接受熊猫的回答。
非常感谢您的帮助!
解决方法
这是使用stack
的熊猫
df_final = (df.set_index(['Record','Hospital','Hospital Address'])
.stack(dropna=False)
.rename('Value')
.reset_index()
.rename({'level_3': 'Name'},axis=1)
.assign(Record=lambda x: x.index+1))
Out[120]:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
,
您还可以使用.melt
并指定id_vars
。其他所有事项都将被考虑为value_vars
。您拥有的value_vars
列数将数据框中的行数乘以该数,将来自四列的所有列信息堆叠为一列,并将id_var
列复制为所需的列数格式:
数据帧设置:
import pandas as pd
df = pd.DataFrame({'Record': {0: 1,1: 2,2: 3},'Hospital': {0: 'Red Cross',1: 'Alberta Hospital',2: 'General Hospital'},'Hospital Address': {0: '1234 Street 429',1: '553 Alberta Road 441',2: '994 Random Street 923'},'Medicine_1': {0: 'Effective',1: 'Effecive',2: 'Normal'},'Medicine_2': {0: 'Effective',1: 'Normal',2: 'Effective'},'Medicine_3': {0: 'Normal','Medicine_4': {0: 'Effective',1: 'Effective',2: 'Effective'}})
代码:
df = (df.melt(id_vars=['Record','Hospital Address'],var_name='Name',value_name='Value')
.sort_values('Record')
.reset_index(drop=True))
df['Record'] = df.index+1
df
Out[1]:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
,
使用stack
使用pyspark也非常简单/容易。
# create sample data
import pandas as pd
from pyspark.sql.functions import expr
panda_df = pd.DataFrame({'Record': {0: 1,2: 'Effective'}})
df = spark.createDataFrame(panda_df)
# calculate
df.select("Hospital","Hospital Address",expr("stack(4,'Medicine_1',Medicine_1,'Medicine_2',Medicine_2,\
'Medicine_3',Medicine_3,'Medicine_4',Medicine_4) as (MedicinName,Effectiveness)")
).where("Effectiveness is not null").show()
在列很多的情况下动态查询生成:
这里的主要思想是动态创建堆栈(x,a,b,c)。我们可以利用python字符串格式来制作动态sring。
index_cols= ["Hospital","Hospital Address"]
drop_cols = ['Record']
# Select all columns which needs to be pivoted down
pivot_cols = [c for c in df.columns if c not in index_cols+drop_cols ]
# Create a dynamic stackexpr in this case we are generating stack(4,'{0}',{0},'{1}',{1}...)
# " '{0}',{1}".format('Medicine1','Medicine2') = "'Medicine1',Medicine1,'Medicine2',Medicine2"
# which is similiar to what we have previously
stackexpr = "stack("+str(len(pivot_cols))+","+",".join(["'{"+str(i)+"}',{"+str(i)+"}" for i in range(len(pivot_cols))]) +")"
df.selectExpr(*index_cols,stackexpr.format(*pivot_cols) ).show()
输出:
+----------------+--------------------+-----------+-------------+
| Hospital| Hospital Address|MedicinName|Effectiveness|
+----------------+--------------------+-----------+-------------+
| Red Cross| 1234 Street 429| Medicine_1| Effective|
| Red Cross| 1234 Street 429| Medicine_2| Effective|
| Red Cross| 1234 Street 429| Medicine_3| Normal|
| Red Cross| 1234 Street 429| Medicine_4| Effective|
|Alberta Hospital|553 Alberta Road 441| Medicine_1| Effecive|
|Alberta Hospital|553 Alberta Road 441| Medicine_2| Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_3| Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_4| Effective|
|General Hospital|994 Random Street...| Medicine_1| Normal|
|General Hospital|994 Random Street...| Medicine_2| Effective|
|General Hospital|994 Random Street...| Medicine_3| Normal|
|General Hospital|994 Random Street...| Medicine_4| Effective|
+----------------+--------------------+-----------+-------------+
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。