火花如何将json字符串转换为没有模式的struct列

如何解决火花如何将json字符串转换为没有模式的struct列

火花：3.0.0 Scala：2.12.8

我的数据框有一个带有JSON字符串的列，我想用StructType从中创建一个新列。


|temp_json_string                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+
|{"name":"test","id":"12","category":[{"products":["A","B"],"displayName":"test_1","displayLabel":"test1"},{"products":["C"],"displayName":"test_2","displayLabel":"test2"}],"createdAt":"","createdBy":""}|
+

root
 |-- temp_json_string: string (nullable = true)

json字符串看起来像

{
  "name":"test","category":[
    {
      "products":[
        "A","B"
      ],"displayLabel":"test1"
    },{
      "products":[
        "C"
      ],"displayLabel":"test2"
    }
  ],"createdBy":""
}

我想创建一个Struct类型的新列，所以我尝试了：

 dataFrame
      .withColumn("temp_json_struct",struct(col("temp_json_string")))
      .select("temp_json_struct")

现在，我得到的架构为：

root
 |-- temp_json_struct: struct (nullable = false)
 |    |-- temp_json_string: string (nullable = true)

我正在寻找的东西是

root
 |-- temp_json_struct: struct (nullable = false)
 |    |-- name: string (nullable = true)
      |-- category: array (nullable = true)
         |-- products: array (nullable = true)
         |-- displayName: string (nullable = true)
         |-- displayLabel: string (nullable = true)
      |-- createdAt: timestamp (nullable = true)
      |-- updatedAt: timestamp (nullable = true)

此外，我不知道JSON字符串中的模式。

我一直在寻找其他选择，但无法找出解决方案。

解决方法

我对mongo的数据有同样的问题。 _doc 是具有json字符串的列。我的有多个文件，所以这就是第一行在每一行中进行迭代以提取架构的原因。另外，如果您事先知道自己的架构，则只需将其替换为json_schema。

json_schema = spark.read.json(df.rdd.map(lambda row: row._doc)).schema
df= df.withColumn('new_json_column',from_json(col('_doc'),json_schema))

至少有两种不同的方式来检索/发现给定JSON的架构。

为便于说明，我们首先创建一些数据：

import org.apache.spark.sql.types.StructType

val jsData = Seq(
  ("""{
    "name":"test","id":"12","category":[
    {
      "products":[
        "A","B"
      ],"displayName":"test_1","displayLabel":"test1"
    },{
      "products":[
        "C"
      ],"displayName":"test_2","displayLabel":"test2"
    }
  ],"createdAt":"","createdBy":""}""")
)

选项1：schema_of_json

第一种选择是使用内置函数schema_of_json。该函数将以DDL格式返回给定JSON的架构：

val json = jsData.toDF("js").collect()(0).getString(0)

val ddlSchema: String = spark.sql(s"select schema_of_json('${json}')")
                            .collect()(0) //get 1st row
                            .getString(0) //get 1st col of the row as string
                            .replace("null","string") //replace type with string,this occurs since you have "createdAt":"" 

// struct<category:array<struct<displayLabel:string,displayName:string,products:array<string>>>,createdAt:null,createdBy:null,id:string,name:string>

val schema: StructType = StructType.fromDDL(s"js_schema $ddlSchema")

请注意，您希望schema_of_json也可以在列级别上使用，即： schema_of_json(js_col)，不幸的是，这无法按预期工作，因此我们不得不传递字符串。

选项2：使用Spark JSON阅读器（推荐）

import org.apache.spark.sql.functions.from_json

val schema: StructType = spark.read.json(jsData.toDS).schema

// schema.printTreeString

// root
//  |-- category: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- displayLabel: string (nullable = true)
//  |    |    |-- displayName: string (nullable = true)
//  |    |    |-- products: array (nullable = true)
//  |    |    |    |-- element: string (containsNull = true)
//  |-- createdAt: string (nullable = true)
//  |-- createdBy: string (nullable = true)
//  |-- id: string (nullable = true)
//  |-- name: string (nullable = true)

如您所见，在这里，我们正在基于StructType而不是在前面的情况下基于DDL字符串生成模式。

在发现模式之后，我们可以继续下一步，即将JSON数据转换为结构。为此，我们将使用from_json内置函数：

jsData.toDF("js")
      .withColumn("temp_json_struct",from_json($"js",schema))
      .printSchema()

// root
//  |-- js: string (nullable = true)
//  |-- temp_json_struct: struct (nullable = true)
//  |    |-- category: array (nullable = true)
//  |    |    |-- element: struct (containsNull = true)
//  |    |    |    |-- displayLabel: string (nullable = true)
//  |    |    |    |-- displayName: string (nullable = true)
//  |    |    |    |-- products: array (nullable = true)
//  |    |    |    |    |-- element: string (containsNull = true)
//  |    |-- createdAt: string (nullable = true)
//  |    |-- createdBy: string (nullable = true)
//  |    |-- id: string (nullable = true)
//  |    |-- name: string (nullable = true)

val df = ??? //create your dataframe having the 'temp_json_string' column

//convert dataframe to dataset
val ds = df.select("temp_json_string").as[String]

//read as json
spark.read.json(ds)

火花如何将json字符串转换为没有模式的struct列

如何解决火花如何将json字符串转换为没有模式的struct列

解决方法

相关推荐