微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Hudi Pyspark应用示例

如何解决Hudi Pyspark应用示例

需要帮助在Pycharm或任何IDE上使用Pyspark设置Hudi。我想开发一个示例Hudi-Pyspark应用程序,而不是在Pyspark Shell中执行它。

解决方法

这是如何在 PyCharm 上设置 Hudi + Pyspark 应用程序的示例。

第 1 步:创建一个项目(使用选项 -> New environment usingVirtualenv

第 2 步:使用以下代码创建模块

第三步,安装pyspark(pip install pyspark)

步骤 4. 右键单击​​并运行模块。

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = (
    SparkSession.builder.appName("Hudi_Data_Processing_Framework")
    .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
    .config("spark.sql.hive.convertMetastoreParquet","false")
    .config(
        "spark.jars.packages","org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.2"
    )
    .getOrCreate()
)

input_df = spark.createDataFrame(
    [
        ("100","2015-01-01","2015-01-01T13:51:39.340396Z"),("101","2015-01-01T12:14:58.597216Z"),("102","2015-01-01T13:51:40.417052Z"),("103","2015-01-01T13:51:40.519832Z"),("104","2015-01-02","2015-01-01T12:15:00.512679Z"),("105","2015-01-01T13:51:42.248818Z"),],("id","creation_date","last_update_time"),)

hudi_options = {
    # ---------------DATA SOURCE WRITE CONFIGS---------------#
    "hoodie.table.name": "hudi_test","hoodie.datasource.write.recordkey.field": "id","hoodie.datasource.write.precombine.field": "last_update_time","hoodie.datasource.write.partitionpath.field": "creation_date","hoodie.datasource.write.hive_style_partitioning": "true","hoodie.upsert.shuffle.parallelism": 1,"hoodie.insert.shuffle.parallelism": 1,"hoodie.consistency.check.enabled": True,"hoodie.index.type": "BLOOM","hoodie.index.bloom.num_entries": 60000,"hoodie.index.bloom.fpp": 0.000000001,"hoodie.cleaner.commits.retained": 2,}

# INSERT
(
    input_df.write.format("org.apache.hudi")
    .options(**hudi_options)
    .mode("append")
    .save("/tmp/hudi_test")
)

# READ
output_df = spark.read.format("org.apache.hudi").load(
    "/tmp/hudi_test/*/*"
)

output_df.show()

enter image description here

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。