微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Apache Hudi - 2 - 基础功能&特性实践

前言

​ 本文对Hudi官网提到的部分特性(功能)做了测试,具体的测试数据均由以下代码直接生成

from faker import Faker


def fake_data(faker: Faker, row_num: int):
    file_name = f'/Users/gavin/Desktop/tmp/student_{row_num}_rows.csv'
    with open(file=file_name, mode='w') as file:
        file.write("id,name,age,adress,partition_path\n")
        for i in range(row_num):
            file.write(
                f'{my_faker.iana_id()},{my_faker.name()},{my_faker.random_int(min=15, max=25)},{my_faker.address()},{my_faker.day_of_week()}\n')


if __name__ == '__main__':
    my_faker = Faker(locale='zh_CN')
    fake_data(my_faker, 100000)

测试数据例:

idnameageadresspartition_path
7548525谭娜15黑龙江省广州市白云姚路w座 391301Sunday
5615440金亮19陕西省巢湖县西峰张街N座 711897Tuesday
3887721刘倩21贵州省敏县清浦深圳路A座 116469Thursday

pyspark启动时引入hudi的命令:

pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

Hudi基础下探

File Layouts(文件结构)

copy-on-Write

hudi表除了parquet文件之外,在表名根目录下,有一个.hoodie文件夹,存储了该表的元信息;示例如下:

gavin@GavindeMacBook-Pro hudi_tables % tree -a student_for_pre_validate
student_for_pre_validate
├── .hoodie #hudi表元信息文件,包含commit信息、marker信息等
│   ├── .20220317111613163.commit.requested.crc
│   ├── .20220317111613163.inflight.crc
│   ├── .aux
│   │   └── .bootstrap #.bootstrap下存放的是进行引导操作的时候的文件,引导操作是用来将已有的表转化为Hudi表的操作,因为没有执行这个,所以下面没有内容
│   │       ├── .fileids
│   │       └── .partitions
│   ├── .hoodie.properties.crc
│   ├── .temp
│   │   └── 20220317111613163
│   │       ├── .MARKERS.type.crc
│   │       ├── .MARKERS0.crc
│   │       ├── MARKERS.type
│   │       └── MARKERS0
│   ├── 20220317111613163.commit.requested
│   ├── 20220317111613163.inflight
│   ├── archived #存放归档Instant的目录,当不断写入Hudi表时,Timeline上的Instant数量会持续增多,为减少Timeline的操作压力,会在Commit时对Instant进行归档,并将Timeline上对应的Instant删除。因为我们的Instant个数尚未达到认值30个,所以并没有产生对应的文件
│   └── hoodie.properties 
├── Friday #具体分区数据
│   ├── ..hoodie_partition_Metadata.crc
│   ├── .65792147-0976-4433-91a1-cb9867326bdf-0_0-30-30_20220317111613163.parquet.crc
│   ├── .hoodie_partition_Metadata
│   └── 65792147-0976-4433-91a1-cb9867326bdf-0_0-30-30_20220317111613163.parquet
└── Wednesday #具体分区数据
    ├── ..hoodie_partition_Metadata.crc
    ├── .4454a7c0-4e4c-4ef6-b790-e066dd2fc8ca-0_1-30-31_20220317111613163.parquet.crc
    ├── .hoodie_partition_Metadata
    └── 4454a7c0-4e4c-4ef6-b790-e066dd2fc8ca-0_1-30-31_20220317111613163.parquet

10 directories, 18 files
gavin@GavindeMacBook-Pro hudi_tables % 
Merge-on-Read

可以参考:Apache Hudi 从入门到放弃(2) —— MOR表的文件结构分析

commit文件中的信息

结论:

  • 一个parquet文件在创建的时候都有一个对应的fileId,该Id作为parquet文件文件名前缀,同时记录在commit文件中;后续对该文件修改只会改变文件名后时间戳部分,前缀fileId不变
  • commit文件中会记录每次每个fileId的「numWrites」、「numDeletes」、「numUpdateWrites」、「numInserts」以及文件大小等其他基本信息
  • commit文件中记录了fileId和具体文件的映射关系
  • commit文件中记录了表的schema信息

具体数据演示

vi 20220316171316850.commit:

{
  "partitionToWriteStats" : {
    "Thursday" : [ {
      "fileId" : "9643d9e7-82b1-4e84-b8e2-0ae625bb54d5-0",
      "path" : "Thursday/9643d9e7-82b1-4e84-b8e2-0ae625bb54d5-0_0-29-41_20220316171316850.parquet",
      "prevCommit" : "null",
      "numWrites" : 461,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 461,
      "totalWriteBytes" : 451097,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "Thursday",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 451097,
      "mineventTime" : null,
      "maxEventTime" : null
    },
			···
			···
			···
			{
      "fileId" : "1efa72c3-a714-46e2-bb91-5019fa6e7ede-0",
      "path" : "Saturday/1efa72c3-a714-46e2-bb91-5019fa6e7ede-0_224-53-265_20220316171316850.parquet",
      "prevCommit" : "null",
      "numWrites" : 210,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 210,
      "totalWriteBytes" : 443162,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "Saturday",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 443162,
      "mineventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"student_record\",\"namespace\":\"hoodie.student\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"name\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"age\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"adress\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"partition_path\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
  },
  "operationType" : "UPSERT",
  "fileIdAndRelativePaths" : {
    "111e0979-9006-441d-9af2-ac9656be4500-0" : "Sunday/111e0979-9006-441d-9af2-ac9656be4500-0_120-47-161_20220316171316850.parquet",
      ...
      ...
      ...
		"a13bf769-7dcb-4aa7-a26f-fd701aa07eaf-0" : "Monday/a13bf769-7dcb-4aa7-a26f-fd701aa07eaf-0_33-47-74_20220316171316850.parquet",
    "9fa086af-e28e-4a3f-9a31-06b658ad514b-0" : "Thursday/9fa086af-e28e-4a3f-9a31-06b658ad514b-0_15-41-56_20220316171316850.parquet"
  },
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalRecordsDeleted" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 36958,
  "totalUpsertTime" : 0,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
    }
  },
  "writePartitionPaths" : [ "Thursday", "Monday", "Friday", "Sunday", "Wednesday", "Tuesday", "Saturday" ]
}

执行了一次upsert之后:

vi 20220316171648081.commit

{
  "partitionToWriteStats" : {
    "Thursday" : [ {
      "fileId" : "5540e2fd-bc18-42db-a831-f72a6d7eb603-0",
      "path" : "Thursday/5540e2fd-bc18-42db-a831-f72a6d7eb603-0_0-29-492_20220316171648081.parquet",
      "prevCommit" : "20220316171316850",
      "numWrites" : 459,
      "numDeletes" : 0,
      "numUpdateWrites" : 1,
      "numInserts" : 0,
      "totalWriteBytes" : 450943,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "Thursday",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 450943,
      "mineventTime" : null,
      "maxEventTime" : null
    },
		···
		···
		···
		{
      "fileId" : "d02425d8-0216-4a3b-9810-b613d80cd60f-0",
      "path" : "Saturday/d02425d8-0216-4a3b-9810-b613d80cd60f-0_433-53-925_20220316171648081.parquet",
      "prevCommit" : "null",
      "numWrites" : 84,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 84,
      "totalWriteBytes" : 439040,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "Saturday",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 439040,
      "mineventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"student_record\",\"namespace\":\"hoodie.student\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"name\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"age\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"adress\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"partition_path\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
  },
  "operationType" : "UPSERT",
  "fileIdAndRelativePaths" : {
    "0d288e1e-f593-4782-95de-0583c4cd286b-0" : "Saturday/0d288e1e-f593-4782-95de-0583c4cd286b-0_415-53-907_20220316171648081.parquet",
    ···
    ···
    ···
    "cc32046a-55b1-4b2b-be93-3225e42154b7-0" : "Saturday/cc32046a-55b1-4b2b-be93-3225e42154b7-0_211-53-703_20220316171648081.parquet"
  },
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "writePartitionPaths" : [ "Thursday", "Monday", "Friday", "Sunday", "Wednesday", "Tuesday", "Saturday" ],
  "totalRecordsDeleted" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 31116,
  "totalUpsertTime" : 37426,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
    }
  }
}

upsert数据时候数据文件变化

结论:upsert数据之后,会新增一个新版的数据文件,新的版本数据文件中包含了历史数据和新的数据;之前的版本文件不会变化

测试代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getorCreate()
    sc = spark.sparkContext

    tableName = "student"
    basePath = "file:///tmp/hudi_base_path"
    csv_path = '/Users/gavin/Desktop/tmp/student_3_rows.csv'
    csv_df = spark.read.csv(path=csv_path, header='true')
    csv_df.printSchema()
    print(f'csv_df.count(): [{csv_df.count()}]')
    hudi_options = {
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'id',
        'hoodie.datasource.write.partitionpath.field': 'partition_path',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.precombine.field': 'age',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2
    }

    csv_df.write.format("hudi"). \
        options(**hudi_options). \
        mode("append"). \
        save(basePath)

历史数据如下:

gavin@GavindeMacBook-Pro tmp % cat student_5_rows.csv 
id,name,age,adress,partition_path
4169306,邵晶,17,广西壮族自治区楠县花溪张路H座 932045,Saturday
4345298,陈海燕,15,内蒙古自治区郑州市浔阳石家庄街d座 725757,Wednesday
1759335,杨波,16,贵州省上海县平山程街s座 255034,Thursday
3141294,毛秀兰,17,浙江省海燕县东城石家庄街O座 459489,Saturday
2580276,王凤兰,22,宁夏回族自治区兴安盟县永川唐路A座 437666,Wednesday
gavin@GavindeMacBook-Pro tmp % cat student_3_rows.csv

upsert数据如下:

gavin@GavindeMacBook-Pro tmp % cat student_3_rows.csv
id,name,age,adress,partition_path
7548525,谭娜,15,黑龙江省广州市白云姚路w座 391301,Sunday
5615440,金亮,19,陕西省巢湖县西峰张街N座 711897,Tuesday
3887721,刘倩,21,贵州省敏县清浦深圳路A座 116469,Thursday

执行upsert之后,对于「Thursday」分区来说,会新增数据

#执行了upsert之前
gavin@GavindeMacBook-Pro Thursday % ll
total 856
-rw-r--r--  1 gavin  wheel  435628 Mar 16 11:14 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316111454901.parquet
gavin@GavindeMacBook-Pro Thursday % pwd
/tmp/hudi_base_path/Thursday
#执行了upsert之后,对应的分区下新增了一个parquet文件
gavin@GavindeMacBook-Pro Thursday % ll
total 1712
-rw-r--r--  1 gavin  wheel  435628 Mar 16 11:14 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316111454901.parquet
-rw-r--r--  1 gavin  wheel  435051 Mar 16 11:21 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316112130171.parquet
gavin@GavindeMacBook-Pro Thursday % 

查看parquet文件的具体数据

# 执行upsert之前的文件
>>> spark.read.parquet('/tmp/hudi_base_path/Thursday/53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316111454901.parquet').show()
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|     id|name|age|                        adress|partition_path|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
|  20220316111454901|20220316111454901...|           1759335|              Thursday|53188680-ecdf-4b0...|1759335|杨波| 16|贵州省上海县平山程街s座 255034|      Thursday|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
#执行upsert之后生成文件
>>> spark.read.parquet('/tmp/hudi_base_path/Thursday/53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316112130171.parquet').show()
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|     id|name|age|                        adress|partition_path|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
|  20220316111454901|20220316111454901...|           1759335|              Thursday|53188680-ecdf-4b0...|1759335|杨波| 16|贵州省上海县平山程街s座 255034|      Thursday|
|  20220316112130171|20220316112130171...|           3887721|              Thursday|53188680-ecdf-4b0...|3887721|刘倩| 21|贵州省敏县清浦深圳路A座 116469|      Thursday|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+

手动删除历史版本数据文件的影响

结论:1. 删除了历史版本数据之后,不会影响其他版本的数据;2. 查询删除了数据文件的版本的时候,不会报错,但是查询的时候数据会缺失被删除的部分

gavin@GavindeMacBook-Pro .hoodie % ll
total 40
-rw-r--r--  1 gavin  wheel  3735 Mar 16 11:14 20220316111454901.commit
-rw-r--r--  1 gavin  wheel     0 Mar 16 11:14 20220316111454901.commit.requested
-rw-r--r--  1 gavin  wheel  2486 Mar 16 11:14 20220316111454901.inflight
-rw-r--r--  1 gavin  wheel  3730 Mar 16 11:21 20220316112130171.commit
-rw-r--r--  1 gavin  wheel     0 Mar 16 11:21 20220316112130171.commit.requested
-rw-r--r--  1 gavin  wheel  2478 Mar 16 11:21 20220316112130171.inflight
drwxr-xr-x  2 gavin  wheel    64 Mar 16 11:14 archived
-rw-r--r--  1 gavin  wheel   593 Mar 16 11:14 hoodie.properties
gavin@GavindeMacBook-Pro .hoodie % 
#查询「Mar 16 11:21」时候的数据条数
>>> spark.read.format('hudi').option('as.of.instant','20220316112130171').load('/tmp/hudi_base_path').count()
8        
#查询「Mar 16 11:14」时候的数据条数
>>> spark.read.format('hudi').option('as.of.instant','20220316111454901').load('/tmp/hudi_base_path').count()
5
>>> 
gavin@GavindeMacBook-Pro Thursday % ll
total 1712
-rw-r--r--  1 gavin  wheel  435628 Mar 16 11:14 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316111454901.parquet
-rw-r--r--  1 gavin  wheel  435051 Mar 16 11:21 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316112130171.parquet
gavin@GavindeMacBook-Pro Thursday % rm 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316111454901.parquet 
#删除「Mar 16 11:14」历史版本的数据文件
gavin@GavindeMacBook-Pro Thursday % ll
total 856
-rw-r--r--  1 gavin  wheel  435051 Mar 16 11:21 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316112130171.parquet
gavin@GavindeMacBook-Pro Thursday % 
#删除了历史版本之后,对最新版的查询不影响
>>> spark.read.format('hudi').option('as.of.instant','20220316112130171').load('/tmp/hudi_base_path').show()
+-------------------+--------------------+------------------+----------------------+--------------------+-------+------+---+------------------------------------+--------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|     id|  name|age|                              adress|partition_path|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+------+---+------------------------------------+--------------+
|  20220316112130171|20220316112130171...|           7548525|                Sunday|c297b5a9-128f-488...|7548525|  谭娜| 15|    黑龙江省广州市白云姚路w座 391301|        Sunday|
|  20220316111454901|20220316111454901...|           4345298|             Wednesday|d78b45e2-0d97-470...|4345298|陈海燕| 15|内蒙古自治区郑州市浔阳石家庄街d座...|     Wednesday|
|  20220316111454901|20220316111454901...|           2580276|             Wednesday|d78b45e2-0d97-470...|2580276|王凤兰| 22|宁夏回族自治区兴安盟县永川唐路A座...|     Wednesday|
|  20220316111454901|20220316111454901...|           3141294|              Saturday|8868d778-2ffd-461...|3141294|毛秀兰| 17|   浙江省海燕县东城石家庄街O座 45...|      Saturday|
|  20220316111454901|20220316111454901...|           4169306|              Saturday|8868d778-2ffd-461...|4169306|  邵晶| 17|  广西壮族自治区楠县花溪张路H座 9...|      Saturday|
|  20220316112130171|20220316112130171...|           5615440|               Tuesday|13fc6b03-48f1-414...|5615440|  金亮| 19|      陕西省巢湖县西峰张街N座 711897|       Tuesday|
|  20220316111454901|20220316111454901...|           1759335|              Thursday|53188680-ecdf-4b0...|1759335|  杨波| 16|      贵州省上海县平山程街s座 255034|      Thursday|
|  20220316112130171|20220316112130171...|           3887721|              Thursday|53188680-ecdf-4b0...|3887721|  刘倩| 21|      贵州省敏县清浦深圳路A座 116469|      Thursday|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+------+---+------------------------------------+--------------+

>>> spark.read.format('hudi').option('as.of.instant','20220316112130171').load('/tmp/hudi_base_path').count()
8
>>> 
#但是查询已经删除了数据的版本的时候,数据少了被删除的部分
>>> spark.read.format('hudi').option('as.of.instant','20220316111454901').load('/tmp/hudi_base_path').count()
4

insert数据前执行

precombine.field 功能校验

结论:在数据真正写入之前,如果有写入的数据中有相同的key值,那么hudi会将「precombine.field」进行比较,取大的数据作为新数据插入;

测试代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getorCreate()
    sc = spark.sparkContext

    tableName = "student_mor"
    basePath = "file:///tmp/hudi_test/student_precombine_validate"
    csv_path = '/Users/gavin/Desktop/tmp/student_3_rows.csv'
    csv_df = spark.read.csv(path=csv_path, header='true')
    csv_df.printSchema()
    print(f'csv_df.count(): [{csv_df.count()}]')
    hudi_options = {
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'id',
        'hoodie.datasource.write.partitionpath.field': 'partition_path',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.precombine.field': 'age',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2
    }

    csv_df.write.format("hudi"). \
        options(**hudi_options). \
        mode("append"). \
        save(basePath)


涉及参数

  • hoodie.datasource.write.precombine.field

    Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareto(…)
    Default Value: ts (Optional)
    Config Param: READ_PRE_COMBINE_FIELD

测试数据(历史)

idnameageadresspartition_path
7548525谭娜15黑龙江省广州市白云姚路w座 391301Sunday
5615440金亮19陕西省巢湖县西峰张街N座 711897Tuesday
3887721刘倩21贵州省敏县清浦深圳路A座 116469Thursday

测试数据(增量)

idnameageadresspartition_path
5615440金亮25陕西省巢湖县西峰张街N座 711897Tuesday
5615440金亮27陕西省巢湖县西峰张街N座 711897Tuesday

结果查询

执行了增量数据的upsert之后,表中关于「金亮」的数据,「age」字段的值由19变成了27,而不是25(同时有证明了record key 的唯一性,毕竟就是根据recordkey进行更新的)

                                                                                ======== 表中共计[3]条数据
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+--------------------------------+--------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|     id|name|age|                          adress|partition_path|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+--------------------------------+--------------+
|  20220321114056501|20220321114056501...|           3887721|              Thursday|9efb9ef5-3339-4f3...|3887721|刘倩| 21|  贵州省敏县清浦深圳路A座 116469|      Thursday|
|  20220321114056501|20220321114056501...|           7548525|                Sunday|579317c9-c569-457...|7548525|谭娜| 15|黑龙江省广州市白云姚路w座 391301|        Sunday|
|  20220321114824576|20220321114824576...|           5615440|               Tuesday|131332e9-874d-435...|5615440|金亮| 27|  陕西省巢湖县西峰张街N座 711897|       Tuesday|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+--------------------------------+--------------+

Hudi相关特性实践

ps:以下实践均是基于「copy-On-Write」表进行的

Upsert时候控制小文件数量文件大小

**结论:**parquet文件的大小会尽量控制在「hoodie.parquet.small.file.limit」和「hoodie.parquet.max.file.size」之间,但是不是向这最大文件size满足,感觉更像是优先保证满足最小文件size

涉及参数

  • hoodie.parquet.max.file.size

    Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.
    Default Value: 125829120 (Optional)
    Config Param: PARQUET_MAX_FILE_SIZE

  • hoodie.parquet.small.file.limit

    During upsert operation, we opportunistically expand existing small files on storage, instead of writing new files, to keep number of files to an optimum. This config sets the file size limit below which a file on storage becomes a candidate to be selected as such a small file. By default, treat any file <= 100MB as a small file.
    Default Value: 104857600 (Optional)
    Config Param: PARQUET_SMALL_FILE_LIMIT

代码

import os

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getorCreate()
    sc = spark.sparkContext

    tableName = "student"
    basePath = "file:///tmp/hudi_base_path"
    csv_path = '/Users/gavin/Desktop/tmp/student_30000_rows.csv'
    csv_df = spark.read.csv(path=csv_path, header='true')
    csv_df.printSchema()
    # csv_df.show()
    print(f'csv_df.count(): [{csv_df.count()}]')
    hudi_options = {
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'id',
        'hoodie.datasource.write.partitionpath.field': 'partition_path',
        'hoodie.datasource.write.table.name': tableName,
        # 'hoodie.datasource.write.operation': 'insert', 不配置的时候,认值为upsert
        'hoodie.datasource.write.precombine.field': 'age',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2,
        'hoodie.parquet.max.file.size': 1024*1024*13,
        'hoodie.parquet.small.file.limit': 1024 *1024*1
    }

    csv_df.write.format("hudi"). \
        options(**hudi_options). \
        mode("append"). \
        save(basePath)

测试结果

gavin@GavindeMacBook-Pro hudi_base_path % du -sh ./*
872K	./Friday
872K	./Monday
868K	./Saturday
872K	./Sunday
868K	./Thursday
868K	./Tuesday
876K	./Wednesday
gavin@GavindeMacBook-Pro hudi_base_path % cd Tuesday 
gavin@GavindeMacBook-Pro Tuesday % ll
total 1704
-rw-r--r--  1 gavin  wheel  872301 Mar 16 13:35 4af17600-ed3d-4765-9d7a-0fd87ef19afc-0_5-41-46_20220316133513897.parquet
gavin@GavindeMacBook-Pro Tuesday % du -sh ./*
852K	./4af17600-ed3d-4765-9d7a-0fd87ef19afc-0_5-41-46_20220316133513897.parquet
gavin@GavindeMacBook-Pro Tuesday % 
#先删除了原来的所有数据,重新进行数据录入10W条数据,大小设置为「1024 *1024*1 ~ 1024*1024*13」之后:
gavin@GavindeMacBook-Pro Tuesday % ll
total 2568
-rw-r--r--  1 gavin  wheel  845757 Mar 16 13:52 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_10-41-51_20220316135246436.parquet
-rw-r--r--  1 gavin  wheel  465912 Mar 16 13:52 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_11-41-52_20220316135246436.parquet
#upsert 5W条数据
gavin@GavindeMacBook-Pro Tuesday % ll -rt
total 5544
-rw-r--r--  1 gavin  wheel   465912 Mar 16 13:52 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_11-41-52_20220316135246436.parquet
-rw-r--r--  1 gavin  wheel   845757 Mar 16 13:52 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_10-41-51_20220316135246436.parquet
-rw-r--r--  1 gavin  wheel   465589 Mar 16 13:55 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_8-41-64_20220316135452779.parquet
-rw-r--r--  1 gavin  wheel  1055128 Mar 16 13:55 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_9-41-65_20220316135452779.parquet
#upsert 3W条数据
gavin@GavindeMacBook-Pro Tuesday % ll -rt
total 8792
-rw-r--r--  1 gavin  wheel   465912 Mar 16 13:52 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_11-41-52_20220316135246436.parquet
-rw-r--r--  1 gavin  wheel   845757 Mar 16 13:52 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_10-41-51_20220316135246436.parquet
-rw-r--r--  1 gavin  wheel   465589 Mar 16 13:55 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_8-41-64_20220316135452779.parquet
-rw-r--r--  1 gavin  wheel  1055128 Mar 16 13:55 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_9-41-65_20220316135452779.parquet
-rw-r--r--  1 gavin  wheel   603980 Mar 16 14:02 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_13-41-69_20220316140250440.parquet
-rw-r--r--  1 gavin  wheel  1055175 Mar 16 14:02 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_7-41-63_20220316140250440.parquet
gavin@GavindeMacBook-Pro Tuesday %
#upsert 30W条数据
gavin@GavindeMacBook-Pro Tuesday % ll -rt    
total 14504
-rw-r--r--  1 gavin  wheel   465912 Mar 16 13:52 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_11-41-52_20220316135246436.parquet
-rw-r--r--  1 gavin  wheel   845757 Mar 16 13:52 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_10-41-51_20220316135246436.parquet
-rw-r--r--  1 gavin  wheel   465589 Mar 16 13:55 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_8-41-64_20220316135452779.parquet
-rw-r--r--  1 gavin  wheel  1055128 Mar 16 13:55 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_9-41-65_20220316135452779.parquet
-rw-r--r--  1 gavin  wheel   603980 Mar 16 14:02 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_13-41-69_20220316140250440.parquet
-rw-r--r--  1 gavin  wheel  1055175 Mar 16 14:02 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_7-41-63_20220316140250440.parquet
-rw-r--r--  1 gavin  wheel  1055760 Mar 16 14:16 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_11-41-77_20220316141641791.parquet
-rw-r--r--  1 gavin  wheel  1864068 Mar 16 14:16 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_10-41-76_20220316141641791.parquet
gavin@GavindeMacBook-Pro Tuesday % 
#upsert 300W条数据
gavin@GavindeMacBook-Pro Tuesday % ll -rt
total 47680
-rw-r--r--  1 gavin  wheel   465912 Mar 16 13:52 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_11-41-52_20220316135246436.parquet
-rw-r--r--  1 gavin  wheel   845757 Mar 16 13:52 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_10-41-51_20220316135246436.parquet
-rw-r--r--  1 gavin  wheel   465589 Mar 16 13:55 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_8-41-64_20220316135452779.parquet
-rw-r--r--  1 gavin  wheel  1055128 Mar 16 13:55 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_9-41-65_20220316135452779.parquet
-rw-r--r--  1 gavin  wheel   603980 Mar 16 14:02 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_13-41-69_20220316140250440.parquet
-rw-r--r--  1 gavin  wheel  1055175 Mar 16 14:02 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_7-41-63_20220316140250440.parquet
-rw-r--r--  1 gavin  wheel  1055760 Mar 16 14:16 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_11-41-77_20220316141641791.parquet
-rw-r--r--  1 gavin  wheel  1864068 Mar 16 14:16 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_10-41-76_20220316141641791.parquet
-rw-r--r--  1 gavin  wheel  1059910 Mar 16 14:35 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_11-41-89_20220316143328401.parquet
-rw-r--r--  1 gavin  wheel  1873786 Mar 16 14:35 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_10-41-88_20220316143328401.parquet
-rw-r--r--  1 gavin  wheel  3595929 Mar 16 14:36 33eb95a1-db3f-425a-ac55-8dba4555b911-0_23-41-101_20220316143328401.parquet
-rw-r--r--  1 gavin  wheel  9385079 Mar 16 14:36 8762d8e1-1868-4c60-9515-a9df1214f328-0_22-41-100_20220316143328401.parquet

Clustering (收束)特性测试

结论:设置了「hoodie.clustering.inline.max.commits」之后,commit次数达到这个值,就会触发clustering;

涉及参数

Turn on inline clustering - clustering will be run after each write operation is complete
Default Value: false (Optional)
Config Param: INLINE_CLUSTERING
Since Version: 0.7.0

测试代码

import os

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getorCreate()
    sc = spark.sparkContext

    tableName = "student"
    basePath = "file:///tmp/hudi_base_path"
    csv_path = '/Users/gavin/Desktop/tmp/student_100000_rows.csv'
    csv_df = spark.read.csv(path=csv_path, header='true')
    csv_df.printSchema()
    print(f'csv_df.count(): [{csv_df.count()}]')

    hudi_options_for_clusering = {
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'id',
        'hoodie.datasource.write.partitionpath.field': 'partition_path',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.precombine.field': 'age',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2,
        'hoodie.clustering.inline': 'true',
        'hoodie.clustering.inline.max.commits': 3,
        'hoodie.clustering.plan.strategy.target.file.max.bytes':1024*1024*10, # 10M
        'hoodie.clustering.plan.strategy.small.file.limit':1024*500*1, # 500K
        'hoodie.clustering.plan.strategy.sort.columns':'id',
        'hoodie.parquet.max.file.size': 1024*450*1, # 450K
        'hoodie.parquet.small.file.limit': 0
    }
    csv_df.write.format("hudi"). \
        options(**hudi_options_for_clusering). \
        mode("append"). \
        save(basePath)

Commit 文件

gavin@GavindeMacBook-Pro /tmp % ll hudi_base_path/.hoodie
total 2640
-rw-r--r--  1 gavin  wheel  200993 Mar 16 17:13 20220316171316850.commit
-rw-r--r--  1 gavin  wheel       0 Mar 16 17:13 20220316171316850.commit.requested
-rw-r--r--  1 gavin  wheel    5100 Mar 16 17:13 20220316171316850.inflight
-rw-r--r--  1 gavin  wheel  308305 Mar 16 17:15 20220316171506014.commit
-rw-r--r--  1 gavin  wheel       0 Mar 16 17:15 20220316171506014.commit.requested
-rw-r--r--  1 gavin  wheel   92341 Mar 16 17:15 20220316171506014.inflight
-rw-r--r--  1 gavin  wheel  389328 Mar 16 17:17 20220316171648081.commit
-rw-r--r--  1 gavin  wheel       0 Mar 16 17:16 20220316171648081.commit.requested
-rw-r--r--  1 gavin  wheel  157952 Mar 16 17:17 20220316171648081.inflight
-rw-r--r--  1 gavin  wheel   60710 Mar 16 17:18 20220316171751882.replacecommit
-rw-r--r--  1 gavin  wheel       0 Mar 16 17:17 20220316171751882.replacecommit.inflight
-rw-r--r--  1 gavin  wheel  114687 Mar 16 17:17 20220316171751882.replacecommit.requested
drwxr-xr-x  2 gavin  wheel      64 Mar 16 17:13 archived
-rw-r--r--  1 gavin  wheel     593 Mar 16 17:13 hoodie.properties
gavin@GavindeMacBook-Pro /tmp % 

前3次commit每次生成的parquet文件大小保持在400K~500K(「hoodie.parquet.max.file.size」设置的450K);

gavin@GavindeMacBook-Pro Friday % ll -rt
total 139600
#第一次commit
-rw-r--r--  1 gavin  wheel  449713 Mar 16 17:13 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_68-47-109_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451069 Mar 16 17:13 6ffa4af8-cf3f-41e5-9225-3f82cabd3416-0_70-47-111_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  449926 Mar 16 17:13 e65fbef1-5499-41cc-b956-236f1f070e4d-0_65-47-106_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450787 Mar 16 17:13 5609e5fd-8eef-41ef-8fc4-3c6c711a6199-0_67-47-108_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  449952 Mar 16 17:13 9ee1e33f-d28f-443d-88cb-59e841e927a8-0_69-47-110_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  452176 Mar 16 17:13 732e4ddd-ed9a-4c6b-9f59-50ce32f9706d-0_64-47-105_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450516 Mar 16 17:13 dbfcfef3-ddde-41d9-82e4-12b961962c7b-0_66-47-107_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451607 Mar 16 17:13 873f0e77-75d2-41f6-b5cd-e3fa6e7d69b8-0_71-47-112_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451974 Mar 16 17:13 c500a137-757a-4355-90fb-0a38e17b215c-0_72-47-113_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450860 Mar 16 17:13 070bf517-195c-48a0-b0f1-423a4a482592-0_74-47-115_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450590 Mar 16 17:13 1eb137ab-c52e-47f2-ac4b-a25ad2fe5aae-0_73-47-114_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  452988 Mar 16 17:13 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_75-47-116_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450829 Mar 16 17:13 20f468b9-afe3-49d8-905b-712c9f9fd441-0_77-47-118_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451479 Mar 16 17:13 4a173caf-18dc-420c-9d61-4dc6ab366845-0_76-47-117_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  449907 Mar 16 17:13 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_81-47-122_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451221 Mar 16 17:13 cb6fac3b-83bc-483b-92c6-c8aca859c4bc-0_84-47-125_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450260 Mar 16 17:13 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_83-47-124_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451619 Mar 16 17:13 034163f0-823c-42f0-b109-6282d7dab628-0_79-47-120_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450229 Mar 16 17:13 bc5e69a0-133e-42ab-bca5-887d7ed200e8-0_82-47-123_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451089 Mar 16 17:13 f19e5203-af28-4af3-9bf3-de75f4ac9494-0_80-47-121_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450510 Mar 16 17:13 3fc10e56-cf07-447f-a209-22f5e92b4351-0_78-47-119_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450287 Mar 16 17:13 2d6b1e2b-6336-4be5-be19-cc272c3fa62c-0_89-47-130_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450477 Mar 16 17:13 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_87-47-128_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451361 Mar 16 17:13 48f19ef7-7062-4368-9def-9b25d1578ac0-0_85-47-126_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  449939 Mar 16 17:13 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_88-47-129_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450747 Mar 16 17:13 81d27725-d4f2-47ee-a4c6-553c467df7a3-0_86-47-127_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  452180 Mar 16 17:13 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_94-47-135_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  452128 Mar 16 17:13 1716c74e-a8af-41bd-8bea-ecd521758ea6-0_93-47-134_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450150 Mar 16 17:13 4a4f5e05-77e4-4787-859f-fbd3fc27ead8-0_92-47-133_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451382 Mar 16 17:13 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_91-47-132_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  446216 Mar 16 17:13 0a1acb0e-a35f-4f8e-a3a9-d03c02423ac6-0_95-47-136_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451804 Mar 16 17:13 1f63cc38-5827-4fde-8d42-4b51e3907cdb-0_90-47-131_20220316171316850.parqu
#第二次commitet
-rw-r--r--  1 gavin  wheel  450421 Mar 16 17:15 81d27725-d4f2-47ee-a4c6-553c467df7a3-0_35-47-304_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449626 Mar 16 17:15 9ee1e33f-d28f-443d-88cb-59e841e927a8-0_36-47-305_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451791 Mar 16 17:15 1716c74e-a8af-41bd-8bea-ecd521758ea6-0_38-47-307_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450152 Mar 16 17:15 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_37-47-306_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449382 Mar 16 17:15 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_44-47-313_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451295 Mar 16 17:15 873f0e77-75d2-41f6-b5cd-e3fa6e7d69b8-0_40-47-309_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451052 Mar 16 17:15 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_48-47-317_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450520 Mar 16 17:15 070bf517-195c-48a0-b0f1-423a4a482592-0_47-47-316_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449577 Mar 16 17:15 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_39-47-308_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450254 Mar 16 17:15 1eb137ab-c52e-47f2-ac4b-a25ad2fe5aae-0_43-47-312_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450888 Mar 16 17:15 cb6fac3b-83bc-483b-92c6-c8aca859c4bc-0_45-47-314_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451649 Mar 16 17:15 c500a137-757a-4355-90fb-0a38e17b215c-0_46-47-315_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449829 Mar 16 17:15 4a4f5e05-77e4-4787-859f-fbd3fc27ead8-0_42-47-311_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451847 Mar 16 17:15 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_41-47-310_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450199 Mar 16 17:15 dbfcfef3-ddde-41d9-82e4-12b961962c7b-0_50-47-319_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451284 Mar 16 17:15 034163f0-823c-42f0-b109-6282d7dab628-0_49-47-318_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451149 Mar 16 17:15 4a173caf-18dc-420c-9d61-4dc6ab366845-0_55-47-324_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449894 Mar 16 17:15 bc5e69a0-133e-42ab-bca5-887d7ed200e8-0_57-47-326_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450497 Mar 16 17:15 20f468b9-afe3-49d8-905b-712c9f9fd441-0_54-47-323_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449934 Mar 16 17:15 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_52-47-321_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  452662 Mar 16 17:15 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_56-47-325_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449616 Mar 16 17:15 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_51-47-320_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451849 Mar 16 17:15 732e4ddd-ed9a-4c6b-9f59-50ce32f9706d-0_53-47-322_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451479 Mar 16 17:15 1f63cc38-5827-4fde-8d42-4b51e3907cdb-0_58-47-327_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451489 Mar 16 17:15 2aeac070-d67e-4ca3-a186-b5d9c383876e-0_184-53-453_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449996 Mar 16 17:15 2b447f6d-d0fc-4d2e-a0d7-243aa46aacac-0_186-53-455_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450542 Mar 16 17:15 121be9f5-0774-426b-b061-f93817e8568e-0_185-53-454_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450542 Mar 16 17:15 8b9f11a8-b108-462c-b45a-3cef7766d61d-0_187-53-456_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451778 Mar 16 17:15 b7f8ab04-fee2-4455-88a3-53c44a1a8299-0_188-53-457_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450995 Mar 16 17:15 1144170e-b154-4d85-8eed-866393cf2ed4-0_189-53-458_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451001 Mar 16 17:15 6ef97619-31f9-4f8e-b240-98fefed9fa41-0_191-53-460_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451192 Mar 16 17:15 95d724df-ec57-42c8-9de3-9c0f3b0888b3-0_196-53-465_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451287 Mar 16 17:15 20dc678a-c2fe-4156-bb61-f04cb269f248-0_192-53-461_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  452000 Mar 16 17:15 6d5fbff6-ff69-4a3a-9534-407b19154730-0_194-53-463_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450895 Mar 16 17:15 2bc61fdd-e343-4f9b-babd-161478d227a8-0_193-53-462_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450804 Mar 16 17:15 d07f11e8-b78e-4643-aef9-86903d89866d-0_198-53-467_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450956 Mar 16 17:15 1b8ede99-4f6c-43dc-8709-52ac0e307fc0-0_195-53-464_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451451 Mar 16 17:15 2a2e1a62-2325-4389-9ba7-60c9dff21491-0_197-53-466_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450494 Mar 16 17:15 9e70261a-ecf2-4706-a8cd-861e3f02786c-0_190-53-459_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450380 Mar 16 17:15 99b51786-5ec4-4c96-8892-47accd2882db-0_199-53-468_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451301 Mar 16 17:15 09bd1d0d-2d12-4ce3-abd0-50d0c8687e2b-0_200-53-469_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451434 Mar 16 17:15 148bc858-2bec-44fe-891f-60f6165dc17e-0_201-53-470_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450780 Mar 16 17:15 db3e9783-4674-4d73-8fe9-abcd47f19218-0_202-53-471_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451296 Mar 16 17:15 88d248a8-8f77-4ede-8d78-ef953afb8fc2-0_211-53-480_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450226 Mar 16 17:15 6bc3c636-6bf7-4672-84b7-8010c0a26cd6-0_203-53-472_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450595 Mar 16 17:15 59028aa4-a91f-4c82-9d34-d25fee9af494-0_206-53-475_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451978 Mar 16 17:15 20089b65-91f7-43d5-b7d6-d54029ed92db-0_205-53-474_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449974 Mar 16 17:15 443d5fbb-ff22-4fa2-a6f0-98a0f0eaea29-0_212-53-481_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451370 Mar 16 17:15 b44f53a6-e46f-4ae3-9a57-91c8b9cf3692-0_209-53-478_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450341 Mar 16 17:15 32dcd3d4-5ce0-41d1-b7dd-1c9e1ac55fd9-0_204-53-473_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451524 Mar 16 17:15 9ecda712-650f-497a-ae46-1f81462342ee-0_208-53-477_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451285 Mar 16 17:15 261b1fda-52df-466f-8858-2d167b7d8216-0_210-53-479_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450942 Mar 16 17:15 7da056d3-9c71-4731-89f4-cf6bb37d4a5b-0_207-53-476_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451893 Mar 16 17:15 db51b6eb-6107-4121-9328-eb78d950aaf5-0_213-53-482_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451526 Mar 16 17:15 bf3784e7-2a00-4cb3-a9d1-9c49fe59b91d-0_214-53-483_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  442060 Mar 16 17:15 c1eef0f6-fb4a-4fb8-86b1-ad5836668fac-0_215-53-484_20220316171506014.parquet
#第三次commit
-rw-r--r--  1 gavin  wheel  451663 Mar 16 17:17 6d5fbff6-ff69-4a3a-9534-407b19154730-0_56-47-548_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450951 Mar 16 17:17 261b1fda-52df-466f-8858-2d167b7d8216-0_57-47-549_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450197 Mar 16 17:17 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_55-47-547_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450950 Mar 16 17:17 20dc678a-c2fe-4156-bb61-f04cb269f248-0_59-47-551_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449619 Mar 16 17:17 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_58-47-550_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449429 Mar 16 17:17 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_60-47-552_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451134 Mar 16 17:17 2a2e1a62-2325-4389-9ba7-60c9dff21491-0_61-47-553_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449658 Mar 16 17:17 443d5fbb-ff22-4fa2-a6f0-98a0f0eaea29-0_62-47-554_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450563 Mar 16 17:17 070bf517-195c-48a0-b0f1-423a4a482592-0_63-47-555_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451027 Mar 16 17:17 48f19ef7-7062-4368-9def-9b25d1578ac0-0_64-47-556_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450568 Mar 16 17:17 2bc61fdd-e343-4f9b-babd-161478d227a8-0_65-47-557_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449592 Mar 16 17:17 e65fbef1-5499-41cc-b956-236f1f070e4d-0_68-47-560_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449664 Mar 16 17:17 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_66-47-558_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451183 Mar 16 17:17 4a173caf-18dc-420c-9d61-4dc6ab366845-0_67-47-559_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  445898 Mar 16 17:17 0a1acb0e-a35f-4f8e-a3a9-d03c02423ac6-0_69-47-561_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449662 Mar 16 17:17 2b447f6d-d0fc-4d2e-a0d7-243aa46aacac-0_70-47-562_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451641 Mar 16 17:17 20089b65-91f7-43d5-b7d6-d54029ed92db-0_71-47-563_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451900 Mar 16 17:17 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_73-47-565_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450669 Mar 16 17:17 6ef97619-31f9-4f8e-b240-98fefed9fa41-0_72-47-564_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449892 Mar 16 17:17 6bc3c636-6bf7-4672-84b7-8010c0a26cd6-0_74-47-566_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450468 Mar 16 17:17 d07f11e8-b78e-4643-aef9-86903d89866d-0_76-47-568_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451699 Mar 16 17:17 c500a137-757a-4355-90fb-0a38e17b215c-0_75-47-567_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450769 Mar 16 17:17 f19e5203-af28-4af3-9bf3-de75f4ac9494-0_78-47-570_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451093 Mar 16 17:17 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_77-47-569_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451117 Mar 16 17:17 148bc858-2bec-44fe-891f-60f6165dc17e-0_79-47-571_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449980 Mar 16 17:17 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_80-47-572_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450023 Mar 16 17:17 32dcd3d4-5ce0-41d1-b7dd-1c9e1ac55fd9-0_82-47-574_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450684 Mar 16 17:17 1144170e-b154-4d85-8eed-866393cf2ed4-0_81-47-573_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450622 Mar 16 17:17 1b8ede99-4f6c-43dc-8709-52ac0e307fc0-0_83-47-575_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450188 Mar 16 17:17 3fc10e56-cf07-447f-a209-22f5e92b4351-0_85-47-577_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  452714 Mar 16 17:17 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_84-47-576_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451258 Mar 16 17:17 60d615c4-0355-44aa-8692-93dcac902bad-0_278-53-770_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451940 Mar 16 17:17 7cb0f09b-e54d-4c78-9306-53e044676a94-0_277-53-769_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450629 Mar 16 17:17 e923dad4-a72c-49a0-8885-0982008ceccf-0_276-53-768_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451596 Mar 16 17:17 d3e720e3-88fa-446b-a522-d44cf7497674-0_281-53-773_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450251 Mar 16 17:17 6f5d416f-e8ff-456c-a0fc-75c7a3a32308-0_279-53-771_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451586 Mar 16 17:17 64d0a424-7744-4423-b86d-3fce04a5046b-0_283-53-775_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450225 Mar 16 17:17 29ae854e-45a1-4222-98c1-2d0acc1c8884-0_282-53-774_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451972 Mar 16 17:17 06f51fc0-1078-4d3a-ae1f-67684917eb1b-0_280-53-772_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451680 Mar 16 17:17 e095fe81-189b-4f5d-8395-7239667ad2d8-0_286-53-778_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451413 Mar 16 17:17 7261ddee-42c5-4abc-8c4f-8f226072b826-0_285-53-777_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451844 Mar 16 17:17 0bcbbdc8-1d2e-4526-88f0-11d17cfff835-0_284-53-776_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451977 Mar 16 17:17 7eddebc9-6828-4053-ad7d-0831daa000ae-0_289-53-781_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451308 Mar 16 17:17 65982416-73c0-41b0-972c-4c4355f3b235-0_290-53-782_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451078 Mar 16 17:17 e62f224c-8d1f-4349-ad3f-81886fba230d-0_288-53-780_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450278 Mar 16 17:17 53764539-435e-4f6f-a7e4-d48fe7966389-0_287-53-779_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451599 Mar 16 17:17 139b4d5c-ed80-42ff-8f97-671f13390edb-0_291-53-783_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  452320 Mar 16 17:17 98d687a0-c93d-4326-b4a0-6f2540bd9aeb-0_292-53-784_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450597 Mar 16 17:17 285cb019-5481-4556-8a96-bc8248028778-0_295-53-787_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451878 Mar 16 17:17 1a7a27ab-8670-4b9c-bfde-3a1dba0669d8-0_293-53-785_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449657 Mar 16 17:17 fecc9a85-371d-493c-83f2-35b5849ee0cb-0_294-53-786_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450069 Mar 16 17:17 c3cc410b-13c3-4f00-ac53-fec9a8f307a1-0_297-53-789_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451344 Mar 16 17:17 ebf3f3b4-6e1a-4bb7-be85-0d2ee31126e5-0_298-53-790_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  452456 Mar 16 17:17 a77ec88d-1ccb-40a5-b8a8-eaf214cfa6a9-0_296-53-788_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450826 Mar 16 17:17 574b5510-586a-49a5-a2ac-b75ebe90b87d-0_301-53-793_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450752 Mar 16 17:17 bc2d6241-343c-4c47-9125-2f63b269117e-0_300-53-792_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450909 Mar 16 17:17 4b4abae6-0dfd-4761-871f-2224ddccb1ec-0_303-53-795_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450745 Mar 16 17:17 e7068c6a-7a2d-43ae-b030-59bebd68f36b-0_304-53-796_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451571 Mar 16 17:17 5bdc45fa-88d9-47b4-9597-e717ae7dbc48-0_302-53-794_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450517 Mar 16 17:17 41f629d8-189b-4867-af6f-bb91effe9f74-0_299-53-791_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450574 Mar 16 17:17 16c7513f-7f79-48d5-84ff-e784f2d1e795-0_305-53-797_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450647 Mar 16 17:17 40b4c081-baab-430c-a9e3-e93d0527c923-0_306-53-798_20220316171648081.parquet
#第三次commit之后紧跟的一次clustering动作,对应.hoodie文件下timeline的「replacecommit」
-rw-r--r--  1 gavin  wheel  654082 Mar 16 17:18 ee1bf9af-1636-4863-8b3a-7a7a15861573-0_9-78-2745_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel  667860 Mar 16 17:18 a4e54744-4277-4ce5-96e4-fa2da7010b1f-0_7-78-2743_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel  699511 Mar 16 17:18 a4a3dfe1-d87d-473c-93b2-713b79aef185-0_6-78-2742_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel  742546 Mar 16 17:18 3e74d509-e720-416b-a1c7-9380e5e4a830-0_8-78-2744_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel  737640 Mar 16 17:18 9a0124e9-5e16-4ea5-8b7b-27aea81f4d92-0_5-78-2741_20220316171751882.parquet
gavin@GavindeMacBook-Pro Friday % 

Cleaning (清理)数据

**结论:**每次执行了upsert之后都会(认)主动进行清理。

涉及参数

When enabled, the cleaner table service is invoked immediately after each commit, to delete older file slices. It’s recommended to enable this, to ensure Metadata and data storage growth is bounded.
Default Value: true (Optional)
Config Param: AUTO_CLEAN

Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. By default, cleaner spares the file slices written by the last N commits, determined by hoodie.cleaner.commits.retained Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time
Default Value: KEEP_LATEST_COMMITS (Optional)
Config Param: CLEANER_POLICY

Number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries.
Default Value: 10 (Optional)
Config Param: CLEANER_COMMITS_RETAINED

测试代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getorCreate()
    sc = spark.sparkContext

    tableName = "student"
    basePath = "file:///tmp/hudi_base_path"
    csv_path = '/Users/gavin/Desktop/tmp/student_100000_rows.csv'
    csv_df = spark.read.csv(path=csv_path, header='true')
    csv_df.printSchema()
    print(f'csv_df.count(): [{csv_df.count()}]')
    hudi_options = {
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'id',
        'hoodie.datasource.write.partitionpath.field': 'partition_path',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.precombine.field': 'age',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2,
        'hoodie.cleaner.commits.retained': 1 #为了测试效果,直接配置为仅保留一个历史版本
    }

    csv_df.write.format("hudi"). \
        options(**hudi_options). \
        mode("append"). \
        save(basePath)

测试结果

gavin@GavindeMacBook-Pro Friday % ll -rt #运行代码之前
total 139600
-rw-r--r--  1 gavin  wheel  449713 Mar 16 17:13 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_68-47-109_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451069 Mar 16 17:13 6ffa4af8-cf3f-41e5-9225-3f82cabd3416-0_70-47-111_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  449926 Mar 16 17:13 e65fbef1-5499-41cc-b956-236f1f070e4d-0_65-47-106_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450787 Mar 16 17:13 5609e5fd-8eef-41ef-8fc4-3c6c711a6199-0_67-47-108_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  449952 Mar 16 17:13 9ee1e33f-d28f-443d-88cb-59e841e927a8-0_69-47-110_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  452176 Mar 16 17:13 732e4ddd-ed9a-4c6b-9f59-50ce32f9706d-0_64-47-105_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450516 Mar 16 17:13 dbfcfef3-ddde-41d9-82e4-12b961962c7b-0_66-47-107_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451607 Mar 16 17:13 873f0e77-75d2-41f6-b5cd-e3fa6e7d69b8-0_71-47-112_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451974 Mar 16 17:13 c500a137-757a-4355-90fb-0a38e17b215c-0_72-47-113_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450860 Mar 16 17:13 070bf517-195c-48a0-b0f1-423a4a482592-0_74-47-115_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450590 Mar 16 17:13 1eb137ab-c52e-47f2-ac4b-a25ad2fe5aae-0_73-47-114_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  452988 Mar 16 17:13 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_75-47-116_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450829 Mar 16 17:13 20f468b9-afe3-49d8-905b-712c9f9fd441-0_77-47-118_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451479 Mar 16 17:13 4a173caf-18dc-420c-9d61-4dc6ab366845-0_76-47-117_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  449907 Mar 16 17:13 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_81-47-122_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451221 Mar 16 17:13 cb6fac3b-83bc-483b-92c6-c8aca859c4bc-0_84-47-125_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450260 Mar 16 17:13 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_83-47-124_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451619 Mar 16 17:13 034163f0-823c-42f0-b109-6282d7dab628-0_79-47-120_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450229 Mar 16 17:13 bc5e69a0-133e-42ab-bca5-887d7ed200e8-0_82-47-123_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451089 Mar 16 17:13 f19e5203-af28-4af3-9bf3-de75f4ac9494-0_80-47-121_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450510 Mar 16 17:13 3fc10e56-cf07-447f-a209-22f5e92b4351-0_78-47-119_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450287 Mar 16 17:13 2d6b1e2b-6336-4be5-be19-cc272c3fa62c-0_89-47-130_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450477 Mar 16 17:13 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_87-47-128_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451361 Mar 16 17:13 48f19ef7-7062-4368-9def-9b25d1578ac0-0_85-47-126_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  449939 Mar 16 17:13 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_88-47-129_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450747 Mar 16 17:13 81d27725-d4f2-47ee-a4c6-553c467df7a3-0_86-47-127_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  452180 Mar 16 17:13 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_94-47-135_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  452128 Mar 16 17:13 1716c74e-a8af-41bd-8bea-ecd521758ea6-0_93-47-134_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450150 Mar 16 17:13 4a4f5e05-77e4-4787-859f-fbd3fc27ead8-0_92-47-133_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451382 Mar 16 17:13 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_91-47-132_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  446216 Mar 16 17:13 0a1acb0e-a35f-4f8e-a3a9-d03c02423ac6-0_95-47-136_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  451804 Mar 16 17:13 1f63cc38-5827-4fde-8d42-4b51e3907cdb-0_90-47-131_20220316171316850.parquet
-rw-r--r--  1 gavin  wheel  450421 Mar 16 17:15 81d27725-d4f2-47ee-a4c6-553c467df7a3-0_35-47-304_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449626 Mar 16 17:15 9ee1e33f-d28f-443d-88cb-59e841e927a8-0_36-47-305_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451791 Mar 16 17:15 1716c74e-a8af-41bd-8bea-ecd521758ea6-0_38-47-307_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450152 Mar 16 17:15 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_37-47-306_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449382 Mar 16 17:15 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_44-47-313_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451295 Mar 16 17:15 873f0e77-75d2-41f6-b5cd-e3fa6e7d69b8-0_40-47-309_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451052 Mar 16 17:15 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_48-47-317_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450520 Mar 16 17:15 070bf517-195c-48a0-b0f1-423a4a482592-0_47-47-316_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449577 Mar 16 17:15 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_39-47-308_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450254 Mar 16 17:15 1eb137ab-c52e-47f2-ac4b-a25ad2fe5aae-0_43-47-312_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450888 Mar 16 17:15 cb6fac3b-83bc-483b-92c6-c8aca859c4bc-0_45-47-314_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451649 Mar 16 17:15 c500a137-757a-4355-90fb-0a38e17b215c-0_46-47-315_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449829 Mar 16 17:15 4a4f5e05-77e4-4787-859f-fbd3fc27ead8-0_42-47-311_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451847 Mar 16 17:15 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_41-47-310_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450199 Mar 16 17:15 dbfcfef3-ddde-41d9-82e4-12b961962c7b-0_50-47-319_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451284 Mar 16 17:15 034163f0-823c-42f0-b109-6282d7dab628-0_49-47-318_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451149 Mar 16 17:15 4a173caf-18dc-420c-9d61-4dc6ab366845-0_55-47-324_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449894 Mar 16 17:15 bc5e69a0-133e-42ab-bca5-887d7ed200e8-0_57-47-326_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450497 Mar 16 17:15 20f468b9-afe3-49d8-905b-712c9f9fd441-0_54-47-323_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449934 Mar 16 17:15 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_52-47-321_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  452662 Mar 16 17:15 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_56-47-325_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449616 Mar 16 17:15 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_51-47-320_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451849 Mar 16 17:15 732e4ddd-ed9a-4c6b-9f59-50ce32f9706d-0_53-47-322_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451479 Mar 16 17:15 1f63cc38-5827-4fde-8d42-4b51e3907cdb-0_58-47-327_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451489 Mar 16 17:15 2aeac070-d67e-4ca3-a186-b5d9c383876e-0_184-53-453_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449996 Mar 16 17:15 2b447f6d-d0fc-4d2e-a0d7-243aa46aacac-0_186-53-455_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450542 Mar 16 17:15 121be9f5-0774-426b-b061-f93817e8568e-0_185-53-454_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450542 Mar 16 17:15 8b9f11a8-b108-462c-b45a-3cef7766d61d-0_187-53-456_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451778 Mar 16 17:15 b7f8ab04-fee2-4455-88a3-53c44a1a8299-0_188-53-457_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450995 Mar 16 17:15 1144170e-b154-4d85-8eed-866393cf2ed4-0_189-53-458_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451001 Mar 16 17:15 6ef97619-31f9-4f8e-b240-98fefed9fa41-0_191-53-460_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451192 Mar 16 17:15 95d724df-ec57-42c8-9de3-9c0f3b0888b3-0_196-53-465_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451287 Mar 16 17:15 20dc678a-c2fe-4156-bb61-f04cb269f248-0_192-53-461_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  452000 Mar 16 17:15 6d5fbff6-ff69-4a3a-9534-407b19154730-0_194-53-463_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450895 Mar 16 17:15 2bc61fdd-e343-4f9b-babd-161478d227a8-0_193-53-462_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450804 Mar 16 17:15 d07f11e8-b78e-4643-aef9-86903d89866d-0_198-53-467_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450956 Mar 16 17:15 1b8ede99-4f6c-43dc-8709-52ac0e307fc0-0_195-53-464_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451451 Mar 16 17:15 2a2e1a62-2325-4389-9ba7-60c9dff21491-0_197-53-466_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450494 Mar 16 17:15 9e70261a-ecf2-4706-a8cd-861e3f02786c-0_190-53-459_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450380 Mar 16 17:15 99b51786-5ec4-4c96-8892-47accd2882db-0_199-53-468_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451301 Mar 16 17:15 09bd1d0d-2d12-4ce3-abd0-50d0c8687e2b-0_200-53-469_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451434 Mar 16 17:15 148bc858-2bec-44fe-891f-60f6165dc17e-0_201-53-470_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450780 Mar 16 17:15 db3e9783-4674-4d73-8fe9-abcd47f19218-0_202-53-471_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451296 Mar 16 17:15 88d248a8-8f77-4ede-8d78-ef953afb8fc2-0_211-53-480_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450226 Mar 16 17:15 6bc3c636-6bf7-4672-84b7-8010c0a26cd6-0_203-53-472_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450595 Mar 16 17:15 59028aa4-a91f-4c82-9d34-d25fee9af494-0_206-53-475_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451978 Mar 16 17:15 20089b65-91f7-43d5-b7d6-d54029ed92db-0_205-53-474_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  449974 Mar 16 17:15 443d5fbb-ff22-4fa2-a6f0-98a0f0eaea29-0_212-53-481_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451370 Mar 16 17:15 b44f53a6-e46f-4ae3-9a57-91c8b9cf3692-0_209-53-478_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450341 Mar 16 17:15 32dcd3d4-5ce0-41d1-b7dd-1c9e1ac55fd9-0_204-53-473_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451524 Mar 16 17:15 9ecda712-650f-497a-ae46-1f81462342ee-0_208-53-477_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451285 Mar 16 17:15 261b1fda-52df-466f-8858-2d167b7d8216-0_210-53-479_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  450942 Mar 16 17:15 7da056d3-9c71-4731-89f4-cf6bb37d4a5b-0_207-53-476_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451893 Mar 16 17:15 db51b6eb-6107-4121-9328-eb78d950aaf5-0_213-53-482_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451526 Mar 16 17:15 bf3784e7-2a00-4cb3-a9d1-9c49fe59b91d-0_214-53-483_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  442060 Mar 16 17:15 c1eef0f6-fb4a-4fb8-86b1-ad5836668fac-0_215-53-484_20220316171506014.parquet
-rw-r--r--  1 gavin  wheel  451663 Mar 16 17:17 6d5fbff6-ff69-4a3a-9534-407b19154730-0_56-47-548_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450951 Mar 16 17:17 261b1fda-52df-466f-8858-2d167b7d8216-0_57-47-549_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450197 Mar 16 17:17 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_55-47-547_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450950 Mar 16 17:17 20dc678a-c2fe-4156-bb61-f04cb269f248-0_59-47-551_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449619 Mar 16 17:17 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_58-47-550_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449429 Mar 16 17:17 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_60-47-552_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451134 Mar 16 17:17 2a2e1a62-2325-4389-9ba7-60c9dff21491-0_61-47-553_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449658 Mar 16 17:17 443d5fbb-ff22-4fa2-a6f0-98a0f0eaea29-0_62-47-554_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450563 Mar 16 17:17 070bf517-195c-48a0-b0f1-423a4a482592-0_63-47-555_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451027 Mar 16 17:17 48f19ef7-7062-4368-9def-9b25d1578ac0-0_64-47-556_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450568 Mar 16 17:17 2bc61fdd-e343-4f9b-babd-161478d227a8-0_65-47-557_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449592 Mar 16 17:17 e65fbef1-5499-41cc-b956-236f1f070e4d-0_68-47-560_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449664 Mar 16 17:17 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_66-47-558_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451183 Mar 16 17:17 4a173caf-18dc-420c-9d61-4dc6ab366845-0_67-47-559_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  445898 Mar 16 17:17 0a1acb0e-a35f-4f8e-a3a9-d03c02423ac6-0_69-47-561_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449662 Mar 16 17:17 2b447f6d-d0fc-4d2e-a0d7-243aa46aacac-0_70-47-562_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451641 Mar 16 17:17 20089b65-91f7-43d5-b7d6-d54029ed92db-0_71-47-563_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451900 Mar 16 17:17 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_73-47-565_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450669 Mar 16 17:17 6ef97619-31f9-4f8e-b240-98fefed9fa41-0_72-47-564_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449892 Mar 16 17:17 6bc3c636-6bf7-4672-84b7-8010c0a26cd6-0_74-47-566_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450468 Mar 16 17:17 d07f11e8-b78e-4643-aef9-86903d89866d-0_76-47-568_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451699 Mar 16 17:17 c500a137-757a-4355-90fb-0a38e17b215c-0_75-47-567_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450769 Mar 16 17:17 f19e5203-af28-4af3-9bf3-de75f4ac9494-0_78-47-570_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451093 Mar 16 17:17 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_77-47-569_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451117 Mar 16 17:17 148bc858-2bec-44fe-891f-60f6165dc17e-0_79-47-571_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449980 Mar 16 17:17 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_80-47-572_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450023 Mar 16 17:17 32dcd3d4-5ce0-41d1-b7dd-1c9e1ac55fd9-0_82-47-574_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450684 Mar 16 17:17 1144170e-b154-4d85-8eed-866393cf2ed4-0_81-47-573_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450622 Mar 16 17:17 1b8ede99-4f6c-43dc-8709-52ac0e307fc0-0_83-47-575_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450188 Mar 16 17:17 3fc10e56-cf07-447f-a209-22f5e92b4351-0_85-47-577_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  452714 Mar 16 17:17 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_84-47-576_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451258 Mar 16 17:17 60d615c4-0355-44aa-8692-93dcac902bad-0_278-53-770_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451940 Mar 16 17:17 7cb0f09b-e54d-4c78-9306-53e044676a94-0_277-53-769_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450629 Mar 16 17:17 e923dad4-a72c-49a0-8885-0982008ceccf-0_276-53-768_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451596 Mar 16 17:17 d3e720e3-88fa-446b-a522-d44cf7497674-0_281-53-773_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450251 Mar 16 17:17 6f5d416f-e8ff-456c-a0fc-75c7a3a32308-0_279-53-771_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451586 Mar 16 17:17 64d0a424-7744-4423-b86d-3fce04a5046b-0_283-53-775_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450225 Mar 16 17:17 29ae854e-45a1-4222-98c1-2d0acc1c8884-0_282-53-774_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451972 Mar 16 17:17 06f51fc0-1078-4d3a-ae1f-67684917eb1b-0_280-53-772_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451680 Mar 16 17:17 e095fe81-189b-4f5d-8395-7239667ad2d8-0_286-53-778_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451413 Mar 16 17:17 7261ddee-42c5-4abc-8c4f-8f226072b826-0_285-53-777_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451844 Mar 16 17:17 0bcbbdc8-1d2e-4526-88f0-11d17cfff835-0_284-53-776_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451977 Mar 16 17:17 7eddebc9-6828-4053-ad7d-0831daa000ae-0_289-53-781_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451308 Mar 16 17:17 65982416-73c0-41b0-972c-4c4355f3b235-0_290-53-782_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451078 Mar 16 17:17 e62f224c-8d1f-4349-ad3f-81886fba230d-0_288-53-780_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450278 Mar 16 17:17 53764539-435e-4f6f-a7e4-d48fe7966389-0_287-53-779_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451599 Mar 16 17:17 139b4d5c-ed80-42ff-8f97-671f13390edb-0_291-53-783_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  452320 Mar 16 17:17 98d687a0-c93d-4326-b4a0-6f2540bd9aeb-0_292-53-784_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450597 Mar 16 17:17 285cb019-5481-4556-8a96-bc8248028778-0_295-53-787_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451878 Mar 16 17:17 1a7a27ab-8670-4b9c-bfde-3a1dba0669d8-0_293-53-785_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  449657 Mar 16 17:17 fecc9a85-371d-493c-83f2-35b5849ee0cb-0_294-53-786_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450069 Mar 16 17:17 c3cc410b-13c3-4f00-ac53-fec9a8f307a1-0_297-53-789_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451344 Mar 16 17:17 ebf3f3b4-6e1a-4bb7-be85-0d2ee31126e5-0_298-53-790_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  452456 Mar 16 17:17 a77ec88d-1ccb-40a5-b8a8-eaf214cfa6a9-0_296-53-788_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450826 Mar 16 17:17 574b5510-586a-49a5-a2ac-b75ebe90b87d-0_301-53-793_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450752 Mar 16 17:17 bc2d6241-343c-4c47-9125-2f63b269117e-0_300-53-792_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450909 Mar 16 17:17 4b4abae6-0dfd-4761-871f-2224ddccb1ec-0_303-53-795_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450745 Mar 16 17:17 e7068c6a-7a2d-43ae-b030-59bebd68f36b-0_304-53-796_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  451571 Mar 16 17:17 5bdc45fa-88d9-47b4-9597-e717ae7dbc48-0_302-53-794_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450517 Mar 16 17:17 41f629d8-189b-4867-af6f-bb91effe9f74-0_299-53-791_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450574 Mar 16 17:17 16c7513f-7f79-48d5-84ff-e784f2d1e795-0_305-53-797_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  450647 Mar 16 17:17 40b4c081-baab-430c-a9e3-e93d0527c923-0_306-53-798_20220316171648081.parquet
-rw-r--r--  1 gavin  wheel  654082 Mar 16 17:18 ee1bf9af-1636-4863-8b3a-7a7a15861573-0_9-78-2745_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel  667860 Mar 16 17:18 a4e54744-4277-4ce5-96e4-fa2da7010b1f-0_7-78-2743_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel  699511 Mar 16 17:18 a4a3dfe1-d87d-473c-93b2-713b79aef185-0_6-78-2742_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel  742546 Mar 16 17:18 3e74d509-e720-416b-a1c7-9380e5e4a830-0_8-78-2744_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel  737640 Mar 16 17:18 9a0124e9-5e16-4ea5-8b7b-27aea81f4d92-0_5-78-2741_20220316171751882.parquet
gavin@GavindeMacBook-Pro Friday % 
gavin@GavindeMacBook-Pro Friday % 
gavin@GavindeMacBook-Pro Friday % ll -rt #运行代码之后,只保留了一个历史版本的数据
total 14624
-rw-r--r--  1 gavin  wheel   654082 Mar 16 17:18 ee1bf9af-1636-4863-8b3a-7a7a15861573-0_9-78-2745_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel   667860 Mar 16 17:18 a4e54744-4277-4ce5-96e4-fa2da7010b1f-0_7-78-2743_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel   699511 Mar 16 17:18 a4a3dfe1-d87d-473c-93b2-713b79aef185-0_6-78-2742_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel   742546 Mar 16 17:18 3e74d509-e720-416b-a1c7-9380e5e4a830-0_8-78-2744_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel   737640 Mar 16 17:18 9a0124e9-5e16-4ea5-8b7b-27aea81f4d92-0_5-78-2741_20220316171751882.parquet
-rw-r--r--  1 gavin  wheel   699397 Mar 17 10:37 a4a3dfe1-d87d-473c-93b2-713b79aef185-0_13-41-92_20220317103652699.parquet
-rw-r--r--  1 gavin  wheel   667672 Mar 17 10:37 a4e54744-4277-4ce5-96e4-fa2da7010b1f-0_10-41-89_20220317103652699.parquet
-rw-r--r--  1 gavin  wheel   742453 Mar 17 10:37 3e74d509-e720-416b-a1c7-9380e5e4a830-0_14-41-93_20220317103652699.parquet
-rw-r--r--  1 gavin  wheel   737512 Mar 17 10:37 9a0124e9-5e16-4ea5-8b7b-27aea81f4d92-0_11-41-90_20220317103652699.parquet
-rw-r--r--  1 gavin  wheel  1080742 Mar 17 10:37 ee1bf9af-1636-4863-8b3a-7a7a15861573-0_12-41-91_20220317103652699.parquet
gavin@GavindeMacBook-Pro Friday % 

Data Quality(数据质量)

结论: 在Overwrite模式下,如果写入的数据不符合预期,报错:At least one pre-commit validation Failed;(我在append模式的时候运行代码直接报错「java.util.ConcurrentModificationException」,目前还不知到为啥在append模式下会报错),这样就可以在写数据之前对数据做一次校验了

涉及配置

测试代码

存量数据中没有age为17的数据,新数据有一条age为17的记录;使用校验条件「select count(*) from {tableName} where age=17」对新入数据进行校验,并拟订校验结果为「0」,预期将会得到一个不准许写入数据的结果;

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getorCreate()
    sc = spark.sparkContext

    tableName = "student_for_pre_validate"
    basePath = "file:///tmp/hudi_tables/student_for_pre_validate"
    csv_path = '/Users/gavin/Desktop/tmp/student_2_rows.csv'
    csv_df = spark.read.csv(path=csv_path, header='true')
    csv_df.printSchema()
    csv_df.show()
    print(f'csv_df.count(): [{csv_df.count()}]')
    hudi_options = {
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'id',
        'hoodie.datasource.write.partitionpath.field': 'partition_path',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.precombine.field': 'age',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2,
        'hoodie.precommit.validators': 'org.apache.hudi.client.validator.sqlQueryEqualityPreCommitValidator',
        'hoodie.precommit.validators.single.value.sql.queries': f'select count(*) from {tableName} where age=17#0'
    }

    csv_df.write.format("hudi"). \
        options(**hudi_options). \
        mode("overwrite"). \
        save(basePath)

测试数据:存量数据

idnameageadresspartition_path
6070762林婷16江苏省凯县魏都刘街G座 217662Saturday
4566846汤斌15上海市志强市清城辽阳路k座 407334Tuesday
1120433刘宁22黑龙江省马鞍山县龙潭傅路F座 707735Wednesday
305942李凯19重庆市欣市合川姚路K座 936317Monday
1604502冉秀芳25江苏省阜新市沈北新陆街c座 997546Wednesday

测试数据:增量数据

idnameageadresspartition_path
6031576艾璐19北京市静市西夏韩路M座 566903Wednesday
3565711刘霞17四川省石家庄市滨城杨路w座 549721Friday

代码运行结果

报错:At least one pre-commit validation Failed

py4j.protocol.Py4JJavaError: An error occurred while calling o52.save.
: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20220317111613163
	at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:63)
	at org.apache.hudi.table.action.commit.SparkUpsertCommitactionExecutor.execute(SparkUpsertCommitactionExecutor.java:46)
	at org.apache.hudi.table.HoodieSparkcopyOnWriteTable.upsert(HoodieSparkcopyOnWriteTable.java:119)
	at org.apache.hudi.table.HoodieSparkcopyOnWriteTable.upsert(HoodieSparkcopyOnWriteTable.java:103)
	at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:160)
	at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:217)
	at org.apache.hudi.HoodieSparksqlWriter$.write(HoodieSparksqlWriter.scala:277)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runcommand$1(DataFrameWriter.scala:989)
	at org.apache.spark.sql.execution.sqlExecution$.$anonfun$withNewExecutionId$5(sqlExecution.scala:103)
	at org.apache.spark.sql.execution.sqlExecution$.withsqlConfPropagated(sqlExecution.scala:163)
	at org.apache.spark.sql.execution.sqlExecution$.$anonfun$withNewExecutionId$1(sqlExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.sqlExecution$.withNewExecutionId(sqlExecution.scala:64)
	at org.apache.spark.sql.DataFrameWriter.runcommand(DataFrameWriter.scala:989)
	at org.apache.spark.sql.DataFrameWriter.savetoV1Source(DataFrameWriter.scala:438)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieValidationException: At least one pre-commit validation Failed
	at org.apache.hudi.client.utils.SparkValidatorUtils.runValidators(SparkValidatorUtils.java:94)
	at org.apache.hudi.table.action.commit.BaseSparkCommitactionExecutor.runPrecommitValidators(BaseSparkCommitactionExecutor.java:399)
	at org.apache.hudi.table.action.commit.BaseCommitactionExecutor.commitOnAutoCommit(BaseCommitactionExecutor.java:140)
	at org.apache.hudi.table.action.commit.BaseSparkCommitactionExecutor.updateIndexAndCommitIfNeeded(BaseSparkCommitactionExecutor.java:265)
	at org.apache.hudi.table.action.commit.BaseSparkCommitactionExecutor.execute(BaseSparkCommitactionExecutor.java:180)
	at org.apache.hudi.table.action.commit.BaseSparkCommitactionExecutor.execute(BaseSparkCommitactionExecutor.java:82)
	at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:56)
	... 39 more

Hudi的sql建表语句

只需要把sql中的「STORED AS INPUTFORMAT」 设置为 「org.apache.hudi.hadoop.HoodieParquetInputFormat」即可,其他正常不变

在这里插入图片描述

其他参数参考

  • 「hoodie.clean.automatic」

    认true :是否开启自动数据清理,如果关闭upsert 不会执行清理任务。

  • 「hoodie.clean.async」

    认false: 是否异步清理文件。开启异步清理文件的原理是开启一个后台线程,在client执行upsert时就会被调用

  • 「hoodie.cleaner.policy」

    认 HoodieCleaningPolicy.KEEP_LATEST_COMMITS :数据清理策略参数,清理策略参数有两个配置KEEP_LATEST_FILE_VERSIONS和KEEP_LATEST_COMMITS。

  • 「hoodie.cleaner.commits.retained」

    认10 : 在KEEP_LATEST_COMMITS策略中配置生效,根据commit提交次数计算保留多少个fileID版本文件。因为是根据commit提交次数来计算,参数不能大于hoodie.keep.min.commits(最少保留多少次commmit元数据)。

  • 「hoodie.cleaner.fiLeversions.retained」

    认3 : 在KEEP_LATEST_FILE_VERSIONS策略中配置生效,根据文件版本数计算保留多少个fileId版本文件

  • 「hoodie.parquet.small.file.limit」:

    认104857600(100兆):小于100兆的文件会被认为小文件,有新增数据时会被分配数据插入。

  • 「hoodie.copyonwrite.record.size.estimate」:

    认1024 (1kb): 预估一条数据大小多大,用来计算一个桶可以放多少条数据。

  • 「hoodie.record.size.estimation.threshold」:

    认为1: 数据最开始的时候parquet文件没有数据会去用认的1kb预估一条数据的大小,如果有fileid的文件大小大于 (hoodie.record.size.estimation.threshold*hoodie.parquet.small.file.limit) 一条记录的大小将会根据(fileid文件大小/文件的总条数)来计算,所以这里是一个权重值。

  • 「hoodie.parquet.max.file.size」:

    认120 * 1024 * 1024(120兆):文件的最大大小,在分桶时会根据这个大小减去当前fileId文件大小除以预估每条数据大小来计算当前文件还能插入多少数据。因为每条数据大小是预估计算平均值的,所以这里最大文件的大小控制只能接近与你所配置的大小。

  • 「hoodie.copyonwrite.insert.split.size」:

    认500000 :精确控制一个fileid文件存放多少条数据,前提必须关闭hoodie.copyonwrite.insert.auto.split 自动分桶。

  • 「hoodie.copyonwrite.insert.auto.split」:

    认true : 是否开启自动分桶。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐