微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

从不均匀的 Pandas dict-like 系列中提取元素

如何解决从不均匀的 Pandas dict-like 系列中提取元素

鉴于以下示例数据(10 条记录):

test_df = pd.DataFrame({"PN_id": ["745d626b","745d626b","fce503fb","df3d727e","56c00531","72ebb2b3","5d1bc5d3","5c32fc8a","5c32fc8a"],"PN_raw": ['{"audience":{"and":[{"segment":"67537044-27db-4a0b-b5b7-362c9c5b2ba7"},{"tag":"BR","group":"ua_locale_country"},{"tag":"90_P******_BR","group":"******_CRM"}]}}','{"audience":{"and":[{"segment":"67537044-27db-4a0b-b5b7-362c9c5b2ba7"},'{"audience":{"and":[{"and":[{"segment":"850c8d94-1236-45a1-93fc-08b0337b4059"}]},{"and":[{"tag":"All_S****_ES","group":"******_CRM"}]}]}}',{"tag":"All_S*****_BR",{"tag":"P_90_or_S_90_BR",{"tag":"P_90_or_S_90_ESLA",{"and":[{"tag":"P_90_or_S_90_ES","group":"******_CRM"}]}]}}']})

我怎样才能实现以下所需的输出? (在同一个 DF 中或在单独的 DF 中,我认为这是一种可能的可能性):

test_df_desired = pd.DataFrame({"PN_id":["745d626b","segment":["67537044-27db-4a0b-b5b7-362c9c5b2ba7","67537044-27db-4a0b-b5b7-362c9c5b2ba7","850c8d94-1236-45a1-93fc-08b0337b4059","850c8d94-1236-45a1-93fc-08b0337b4059"],"tag_1":["BR","BR","All_S****_ES","P_90_or_S_90_ESLA","P_90_or_S_90_ES","P_90_or_S_90_ES"],"group_1":["ua_locale_country","ua_locale_country","******_CRM","******_CRM"],"tag_2":["90_P******_BR","90_P******_BR",np.nan,"All_S*****_BR","P_90_or_S_90_BR",np.nan],"group_2":["******_CRM",np.nan]})

到目前为止,使用 pd.json_normalize(test_df["PN_raw"].apply(ast.literal_eval),record_path = ["audience","and"]),我已经设法解除了 dict 路径结构为 audience -> and 的记录的嵌套,但是对于路径为 audience -> and -> and 的记录,这不起作用,我也不能绕过它添加 record_path = ["audience","and","and"] 我认为可以工作。我认为这需要通过系列循环解决并根据是否包含一个或两个+“和”应用不同的函数解决

当前输出不仅在上面提到的“失败”,而且还有将数据“转置”到正确行的问题(如果你运行上面的那一行,你就会明白我的意思)。

解决方法

import json


def promote(d):
    if list(d.keys()) == ['and']:
        for i in d['and']:
            yield from promote(i)
    else:
        yield d

parsed = []
data = {"PN_id": ["745d626b","745d626b","fce503fb","df3d727e","56c00531","72ebb2b3","5d1bc5d3","5c32fc8a","5c32fc8a"],"PN_raw": ['{"audience":{"and":[{"segment":"67537044-27db-4a0b-b5b7-362c9c5b2ba7"},{"tag":"BR","group":"ua_locale_country"},{"tag":"90_P******_BR","group":"******_CRM"}]}}','{"audience":{"and":[{"segment":"67537044-27db-4a0b-b5b7-362c9c5b2ba7"},'{"audience":{"and":[{"and":[{"segment":"850c8d94-1236-45a1-93fc-08b0337b4059"}]},{"and":[{"tag":"All_S****_ES","group":"******_CRM"}]}]}}',{"tag":"All_S*****_BR",{"tag":"P_90_or_S_90_BR",{"tag":"P_90_or_S_90_ESLA",{"and":[{"tag":"P_90_or_S_90_ES","group":"******_CRM"}]}]}}']}

data['PN'] = list(map(json.loads,data['PN_raw']))
for ind,pn_id in enumerate(data['PN_id']):
    parsed_sub = {'PN_id': pn_id}
    count = 1
    for chunk in promote(data['PN'][ind]['audience']):
        if 'segment' in chunk:
            parsed_sub.update(chunk)
        else:
            parsed_sub.update({f'{k}{count}': v for k,v in chunk.items()})
            count +=1
    parsed.append(parsed_sub)

df = pd.DataFrame(parsed)

我发现 Pandas 混淆/破坏了 JSON,我更喜欢用基本的 Python 来处理它。我想说 JSON 可以有如此不同的形状,没有什么好方法可以制作通用的“make_the_json_flat()”函数,但如果有这样的东西,我很想了解它。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。