微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何在单独的 Spark 数据帧中存储 Amazon Deequ 的失败状态记录

如何解决如何在单独的 Spark 数据帧中存储 Amazon Deequ 的失败状态记录

我需要运行数据质量测试,因此我为此使用了 Amazon Deequ。 我可以使用下面的代码找到数据质量成功/失败状态,但接下来我想获取检查失败的所有行并将其存储到另一个数据帧/Hive 表中。请帮助我如何获得它。我们也可以同时在多个数据集上执行 Amazon Deequ 吗? 下面是正在运行的代码,需要帮助获取存储错误失败记录的代码

import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.{Check,CheckLevel,CheckStatus}
import com.amazon.deequ.constraints.ConstraintStatus

object Test extends App {

val spark = SparkSession.builder()
      .master("local[*]")
      .appName("amazon-deequ-test")
      .getorCreate();

 val data = Seq((1,"Thingy A","awesome thing.","high",0),(2,"Thingy B","available at http://thingb.com",null,(3,"low",5),(4,"Thingy D","checkout https://thingd.ca",-10),(5,"Thingy E",12))

val cols = Seq("id","productName","description","priority","numViews")
val data = spark.createDataframe(data).toDF(cols: _*)
data.show(false)

val verificationResult: verificationResult = VerificationSuite() {
VerificationSuite()
      .onData(data)
      .addCheck(
        Check(CheckLevel.Error,"integrity checks")
          // we expect 5 records
          .hasSize(_ == 5)
          // 'id' should never be NULL
          .isComplete("id")
          // 'id' should not contain duplicates
          .isUnique("id")
          // 'productName' should never be NULL
          .isComplete("productName")
          // 'priority' should only contain the values "high" and "low"
          .isContainedIn("priority",Array("high","low"))
          // 'numViews' should not contain negative values
          .isNonNegative("numViews"))
      .addCheck(
        Check(CheckLevel.Warning,"distribution checks")
          // at least half of the 'description's should contain a url
          .containsURL("description",_ >= 0.5)
          // half of the items should have less than 10 'numViews'
          .hasApproxQuantile("numViews",0.5,_ <= 10))
      .run()

}

val resultDataFrame = checkResultAsDataFrame(spark,verificationResult).show(false)

}

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。