微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何将半结构化json字符串列转换为pyspark中的数据框?

如何解决如何将半结构化json字符串列转换为pyspark中的数据框?

我正在尝试将以下半结构化json字符串从列转换为结构化数据框

2020-09-24T08:03:01.633Z 10.1.20.1 {"timstamp":"2020-09-24 13:33:01","sourcename":"local","Keys":-9serverkey,"Type":"status","key1":2,"key2":"INFO","key3":5145,"key4":"valuekey4","key5":"{valuekey5}","key6":0,"key7":12,"key8":0,"key9":76,"key10":5,"other_key1":5,"other_key2":"value2","other_key3":"other value 3\r\n\t\r\nSubject:\r\n\tsecurity other_key4:\t\totherKey4\r\n\taccount otherkey5:\t\tothervalue5$\r\n\taccount}

我首先创建了架构,以将上述数据加载到数据框


 schema = StructType([
        StructField("Date",DateType()),StructField("Source IP",StringType()),StructField("Event Type",StringType())
    ])

df = session.read.option("header","true").option("delimiter"," ").csv(
            "mypath\\logs.txt",schema=self.schema)

输出返回以下结构

+----------+-------------+--------------------+
|      Date|    Source IP|          Event Type|
+----------+-------------+--------------------+
|2020-09-2 |10.1.20.1    |{"timstamp":"202...|

现在我只需要从上面的日志数据中从“ timstamp”中提取json到“ key10”,并且可以排除其余的json字符串..因此,如何将包含json字符串的“ Event Type”列转换为此结构化json情况?

感谢帮助吗?

谢谢

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。