如何解决AWS Glue ETL 不从手动创建的外部表中读取数据
我创建了一个指向 S3 存储桶的外部表,该存储桶具有 csv 格式的 gzip 文件。我能够按预期通过红移光谱读取数据。我计划通过 AWS 胶水 ETL 作业转换这些文件。当我尝试从胶水脚本中的这个外部表中读取时,它无法读取数据,而是在我尝试使用 show(5) 打印数据帧时显示 null。这是脚本。
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv,['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'],args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-db",table_name = "table_rs_external")
df_manual=datasource0.toDF()
df_manual.show(5)
applymapping1 = ApplyMapping.apply(frame = datasource0,mappings = [("column1","string","column1","string"),....],transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1,choice = "make_struct",transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2,transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3,connection_type = "s3",connection_options = {"path": "s3:XXXXX"},format = "parquet",transformation_ctx = "datasink4")
job.commit()
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。