如何解决使用Apache Spark将数据存储区导出到GCS Bucket中的DATASTORE_BACKUP
我想每天将数据存储区导出为DATASTORE_BACKUP格式的GCS存储桶。目前,我正在使用curl命令执行“通过VIA GCP数据存储导出服务”,如下所示:
-X POST \
-H "Authorization: Bearer $access_token" \
-H "Content-Type: application/json" \
https://datastore.googleapis.com/v1/projects/viu-data-warehouse-prod:export \
-d '{
"labels": {
"exportVersion": "'"$BUILD_ID"'"
},"outputUrlPrefix": "'"$output_url"'","entityFilter": {
"namespaceIds": ["customer_one_view"],"kinds": ["user_view"]
},}') ```
I want it to be done by Apache Spark to make it faster. My Problem is it takes 5 to 6 hrs to finish and as Data is growing it is increasing,I need suggestion to optimize this process by achieving Parallel processing. I would like to do it via Apache Spark as it is very Fast. Please suggest me how can I do it.
解决方法
如果您不依赖于Spark或特定的导出格式。您可以从https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#datastore-to-cloud-storage-text的“数据存储到GCS Apache Beam(数据流)”模板开始,然后进行分叉以满足您的需求。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。