微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

结合不同列数的 Spark 数据帧

如何解决结合不同列数的 Spark 数据帧

this 问题中,我询问了如何将 PySpark 数据框与不同数量的列组合在一起。给出的答案要求每个数据框必须具有相同数量的列才能将它们组合在一起:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder\
    .appName("DynamicFrame")\
    .getorCreate()

df01 = spark.createDataFrame([(1,2,3),(9,5,6)],("C1","C2","C3"))
df02 = spark.createDataFrame([(11,12,13),(10,15,16)],("C2","C3","C4"))
df03 = spark.createDataFrame([(111,112),(110,115)],"C4"))

dataframes = [df01,df02,df03]

# Create a list of all the column names and sort them
cols = set()
for df in dataframes:
    for x in df.columns:
        cols.add(x)
cols = sorted(cols)

# Create a dictionary with all the dataframes
dfs = {}
for i,d in enumerate(dataframes):
    new_name = 'df' + str(i)  # New name for the key,the dataframe is the value
    dfs[new_name] = d
    # Loop through all column names. Add the missing columns to the dataframe (with value 0)
    for x in cols:
        if x not in d.columns:
            dfs[new_name] = dfs[new_name].withColumn(x,lit(0))
    dfs[new_name] = dfs[new_name].select(cols)  # Use 'select' to get the columns sorted

# Now put it al together with a loop (union)
result = dfs['df0']      # Take the first dataframe,add the others to it
dfs_to_add = dfs.keys()  # List of all the dataframes in the dictionary
dfs_to_add.remove('df0') # Remove the first one,because it is already in the result
for x in dfs_to_add:
    result = result.union(dfs[x])
result.show()

有没有什么方法可以组合 PySpark 数据帧而不必确保所有数据帧具有相同的列数?我问的原因是合并 100 个数据帧需要大约 2 天的时间,但是使用上述代码的过程超时。

解决方法

df = df1.unionByName(df2,allowMissingColumns=True)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。