如何解决正在编写Parquet文件引发...未指定此请求必需的HTTP标头
我有两个ADLSv2存储帐户,两个帐户都启用了分层名称空间。 在我的Python笔记本中,经过一些扩充后,我正在从一个存储帐户读取CSV文件,然后将其作为拼写文件写入另一个存储中。
StatusCode=400,An HTTP header that's mandatory for this request is not
非常感谢您的帮助。
下面是我的笔记本代码段...
# Databricks notebook source
# MAGIC %python
# MAGIC
# MAGIC STAGING_MOUNTPOINT = "/mnt/inputfiles"
# MAGIC if STAGING_MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]:
# MAGIC dbutils.fs.unmount(STAGING_MOUNTPOINT)
# MAGIC
# MAGIC PERM_MOUNTPOINT = "/mnt/outputfiles"
# MAGIC if PERM_MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]:
# MAGIC dbutils.fs.unmount(PERM_MOUNTPOINT)
STAGING_STORAGE_ACCOUNT = "--------"
STAGING_CONTAINER = "--------"
STAGING_FOLDER = --------"
PERM_STORAGE_ACCOUNT = "--------"
PERM_CONTAINER = "--------"
configs = {
"fs.azure.account.auth.type": "OAuth","fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider","fs.azure.account.oauth2.client.id": "#####################","fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="DemoScope",key="DemoSecret"),"fs.azure.account.oauth2.client.endpoint":
"https://login.microsoftonline.com/**********************/oauth2/token"}
STAGING_SOURCE =
"abfss://{container}@{storage_acct}.blob.core.windows.net/".format(container=STAGING_CONTAINER,storage_acct=STAGING_STORAGE_ACCOUNT)
try:
dbutils.fs.mount(
source=STAGING_SOURCE,mount_point=STAGING_MOUNTPOINT,extra_configs=configs)
except Exception as e:
if "Directory already mounted" in str(e):
pass # Ignore error if already mounted.
else:
raise e
print("Staging Storage mount Success.")
inputDemoFile = "{}/{}/demo.csv".format(STAGING_MOUNTPOINT,STAGING_FOLDER)
readDF = (spark
.read.option("header",True)
.schema(inputSchema)
.option("inferSchema",True)
.csv(inputDemoFile))
LANDING_SOURCE =
"abfss://{container}@{storage_acct}.blob.core.windows.net/".format(container=LANDING_CONTAINER,storage_acct=PERM_STORAGE_ACCOUNT)
try:
dbutils.fs.mount(
source=PERM_SOURCE,mount_point=PERM_MOUNTPOINT,extra_configs=configs)
except Exception as e:
if "Directory already mounted" in str(e):
pass # Ignore error if already mounted.
else:
raise e
print("Landing Storage mount Success.")
outPatientsFile = "{}/patients.parquet".format(outPatientsFilePath)
print("Writing to parquet file: " + outPatientsFile)
***Below call is failing…error is
StatusCode=400
StatusDescription=An HTTP header that's mandatory for this request is not specified.
ErrorCode=
ErrorMessage=***
(readDF
.coalesce(1)
.write
.mode("overwrite")
.option("header","true")
.option("compression","snappy")
.parquet(outPatientsFile)
)
解决方法
我总结如下解决方案。
如果要将Azure数据湖存储gen2挂载为Azure databricks文件系统,则URL应类似于abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/
。有关更多详细信息,请参阅here
例如
- 创建一个Azure Data Lake Storage Gen2帐户。
az login
az storage account create \
--name <account-name> \
--resource-group <group name> \
--location westus \
--sku Standard_RAGRS \
--kind StorageV2 \
--enable-hierarchical-namespace true
- 创建服务主体,并将Storage Blob数据贡献者分配给Data Lake Storage Gen2存储帐户范围内的sp
az login
az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
--scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>
- 在Azure数据块(python)中安装Azure数据湖gen2
configs = {"fs.azure.account.auth.type": "OAuth","fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider","fs.azure.account.oauth2.client.id": "<appId>","fs.azure.account.oauth2.client.secret": "<clientSecret>","fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token","fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",mount_point = "/mnt/flightdata",extra_configs = configs)
,
在Azure Databricks中安装存储帐户时要注意的重点。
对于 Azure Blob存储:
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>"
对于 Azure Data Lake Storage第2代:
source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/"
要安装Azure Data Lake Storage Gen2文件系统或其中的文件夹作为Azure Databricks文件系统,URL应类似于abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。