如何解决如何更改从 AWS Step Function Map 并行运行的 Glue 作业的最大并发运行数?
我有一个带有 Map 的 Step Function,可以使用自定义参数运行 5 个并行的 glue 作业,如下所示:
"Run glue Jobs": {
"Type": "Map","MaxConcurrency": 5,"ItemsPath": "$.payload","Iterator": {
"StartAt": "Run Generic glue Job","States": {
"Run Generic glue Job": {
"Type": "Task","Resource": "arn:aws:states:::glue:startJobRun.sync","Parameters": {
"JobName": "glueJobName","Arguments": {
"--target_bucket": "target-bucket","--target_path": "dir1/dir2/","--job-language": "python","--job-bookmark-option": "job-bookmark-disable","--TempDir": "s3://temp-bucket/tmp-glue","--continuous-log-logGroup": "gluecloudwatch","--enable-continuous-cloudwatch-log": "true","--enable-continuous-log-filter": "true","--enable-metrics": "","--kmskeyid": "arn:aws:kms:region:12345678901:alias/glue-kms-key"
}
},"End": true
}
}
},"Next": "Finish"
当我运行 step 函数的这个阶段时,它会捕获错误,即:glue.ConcurrentRunsExceededException 有没有办法在某个地方传递一个参数,比如 MaxConcurrentRuns:5 (ExecutionProperty) 这样我可以同时运行多达 5 个作业?我找不到任何方法来做到这一点。只有这个不相关的资源:https://docs.aws.amazon.com/cli/latest/reference/glue/create-job.html
我只能手动在 GUI 中编辑作业,但我需要在 terraform 中从头开始创建所有内容,因此我还需要从书面源中启用 5 MaxConcurrentRuns。有小费吗?谢谢。
解决方法
我们无法从 Step Functions 设置 Glue Max Concurrent Runs。如果 Step function Map 以 MaxConcurrency
5 运行,我们还需要创建/更新胶水作业最大并发运行数到最小 5。
当我们从 AWS CLI 创建 Glue 作业时,我们可以将 MaxConcurrentRuns
作为 ExecutionProperty.MaxConcurrentRuns 传递
这是一个示例 json
{
"Name": "my-glue-job","Role": "arn:aws:iam::111122223333:role/glue_etl_service_role","ExecutionProperty": {
"MaxConcurrentRuns": 5
},"Command": {
"Name": "glueetl","ScriptLocation": "s3://temp-sandbox/code/scripts/MyGlueScript.scala","PythonVersion": "3"
},"DefaultArguments": {
"--TempDir": "s3://aws-glue-temporary-111122223333-us-east-1/admin","--class": "com.mycompany.corp.MainClass","--enable-continuous-cloudwatch-log": "true","--enable-metrics": "","--enable-spark-ui": "true","--extra-jars": "s3://temp-sandbox/code/jars/MyExtraJar.jar","--job-bookmark-option": "job-bookmark-disable","--job-language": "scala","--spark-event-logs-path": "s3://aws-glue-assets-111122223333-us-east-1/sparkHistoryLogs/"
},"MaxRetries": 0,"Timeout": 2880,"WorkerType": "G.1X","NumberOfWorkers": 10,"GlueVersion": "2.0"
}
使用 cli
aws glue create-job --cli-input-json file://myFolder/glue_job_props.json
,
显然,有一种方法可以直接从 Terraform 资源“aws_glue_job”中指定最大并发运行数,如下所示:
execution_property {
max_concurrent_runs = 5
}
大概是这样解决的。我会尝试使用它。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。