如何提取带有多个标签的Excel电子表格？

如何解决如何提取带有多个标签的Excel电子表格？

我想将Excel文件提取到远程文件夹或SFTP中。它适用于CSV文件，但不适用于XLS或XLSX文件。

解决方法

下面的代码提供了将xls / xlsx文件转换为Spark数据帧的功能。

要使用这些功能，您需要：

将下面的功能复制粘贴到您的存储库中（例如，在utils.py文件中）
创建一个新的转换脚本
在转换脚本中，复制/粘贴示例转换并修改参数。

使用函数的示例转换：

# Parameters for Excel files with multiple tabs ingestion 
SHEETS_PARAMETERS = {
    # Each of these blocks will take one tab of your Excel file ("Artists" here) and write from "header" a dataset in the path provided "/Studio/studio_datasource/artists"
    "Artists": {
        "output_dataset_path": "/Studio/studio_datasource/artists","header": 7
    },"Records": {
        "output_dataset_path": "/Studio/studio_datasource/records","header": 0
    },"Albums": {
        "output_dataset_path": "/Studio/studio_datasource/albums","header": 1
    }
}

# Define the dictionary of outputs needed by the transform's decorator
outputs = {
    sheet_parameter["output_dataset_path"]: Output(sheet_parameter["output_dataset_path"])
    for sheet_parameter in SHEETS_PARAMETERS.values()
}
@transform(
    my_input=Input("/Studio/studio_datasource/excel_file"),**outputs
)
def my_compute_function(my_input,ctx,**outputs):
    # Add the output objects to the parameters
    for sheetname,parameters in SHEETS_PARAMETERS.items():
        output_dataset_path = SHEETS_PARAMETERS[sheetname]["output_dataset_path"]
        SHEETS_PARAMETERS[sheetname]["output_dataset"] = outputs[output_dataset_path]

    # Transform the sheets to datasets
    write_datasets_from_excel_sheets(my_input,SHEETS_PARAMETERS,ctx)

功能：

import pandas as pd
import tempfile
import shutil

def normalize_column_name(cn):
    """
    Remove forbidden characters from the columns names
    """
    invalid_chars = ",;{}()\n\t="
    for c in invalid_chars:
        cn = cn.replace(c,"_")
    return cn

def get_dataframe_from_excel_sheet(fp,sheet_name,header):
    """
    Generate a Spark dataframe from a sheet in an excel file available in Foundry
    Arguments:
        fp:
            TemporaryFile object that allows to read to the file that contains the Excel file
        ctx:
            Context object available in a transform
        sheet_name:
            Name of the sheet
        header:
            Row (0-indexed) to use for the column labels of the parsed DataFrame.
            If a list of integers is passed those row positions will be combined into a MultiIndex.
            Use None if there is no header.
    """
    # Using UTF-8 encoding is safer
    dataframe = pd.read_excel(
        fp,header=header,encoding="utf-8"
    )

    # Cast all the dataframes as string