微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何将 CustomDataAsset 传递给 DataContext 以在批处理上运行自定义期望?

如何解决如何将 CustomDataAsset 传递给 DataContext 以在批处理上运行自定义期望?

我有一个带有自定义期望的 CustomPandasDataset

from great_expectations.data_asset import DataAsset
from great_expectations.dataset import PandasDataset
from datetime import date,datetime,timedelta

class CustomPandasDataset(PandasDataset):

    _data_asset_type = "CustomPandasDataset"
      
    @DataAsset.expectation(["column","datetime_match","datetime_diff"])
    def expect_column_max_value_to_match_datetime(self,column:str,datetime_match: datetime = None,datetime_diff: tuple = None) -> dict:
        """
        Check if data is constantly updated by matching the max datetime column to a
        datetime value or to a datetime difference.
        """
        max_datetime = self[column].max()

        if datetime_match is None:

            from datetime import date

            datetime_match = date.today()

        if datetime_diff:
            
            from datetime import timedelta

            success = (datetime_match - timedelta(*datetime_diff)) <= max_datetime <= datetime_match

        else:

            success = (max_datetime == datetime_match)

        result = {
            "data_max_value": max_datetime,"expected_max_value": str(datetime_match),"expected_datetime_diff": datetime_diff
        }

        return {
            "success": success,"result": result
        }

我想对给定的 Pandas 数据框运行期望 expect_column_max_value_to_match_datetime

expectation_suite_name = "df-raw-expectations"

suite = context.create_expectation_suite(expectation_suite_name,overwrite_existing=True)

df_ge = ge.from_pandas(df,dataset_class=CustomPandasDataset)

batch_kwargs = {'dataset': df_ge,'datasource': 'df_raw_datasource'}

# Get batch of data
batch = context.get_batch(batch_kwargs,suite)

我从 DataContext 中得到的,现在当我对这个批次运行期望时

datetime_diff = 4,batch.expect_column_max_value_to_match_datetime(column='DATE',datetime_diff=datetime_diff)

我收到以下错误

AttributeError: 'PandasDataset' object has no attribute 'expect_column_max_value_to_match_datetime'

根据文档,我在构建 GreatExpectations 数据集时指定了 dataset_class=CustomPandasDataset 属性,确实上运行期望值,df_ge 工作但不适用于批处理数据

解决方法

根据docs

要在数据源或 DataContext 中使用自定义期望,您需要在数据源配置或 batch_kwargs 中为特定批次定义自定义 DataAsset。

所以通过CustomPandasDataset函数的data_asset_type参数传递get_batch()

# Get batch of data
batch = context.get_batch(batch_kwargs,suite,data_asset_type=CustomPandasDataset)

或在上下文配置中定义

from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context import BaseDataContext

data_context_config = DataContextConfig(
    ...
    datasources={
        "sales_raw_datasource": {
            "data_asset_type": {
                "class_name": "CustomPandasDataset","module_name": "custom_dataset",},"class_name": "PandasDatasource","module_name": "great_expectations.datasource",}
    },... 
    )
context = BaseDataContext(project_config=data_context_config)

其中 CustomPandasDataset 可从模块/脚本 custom_dataset.py

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。