从不均匀的 Pandas dict-like 系列中提取元素

如何解决从不均匀的 Pandas dict-like 系列中提取元素

鉴于以下示例数据（10 条记录）：

test_df = pd.DataFrame({"PN_id": ["745d626b","745d626b","fce503fb","df3d727e","56c00531","72ebb2b3","5d1bc5d3","5c32fc8a","5c32fc8a"],"PN_raw": ['{"audience":{"and":[{"segment":"67537044-27db-4a0b-b5b7-362c9c5b2ba7"},{"tag":"BR","group":"ua_locale_country"},{"tag":"90_P******_BR","group":"******_CRM"}]}}','{"audience":{"and":[{"segment":"67537044-27db-4a0b-b5b7-362c9c5b2ba7"},'{"audience":{"and":[{"and":[{"segment":"850c8d94-1236-45a1-93fc-08b0337b4059"}]},{"and":[{"tag":"All_S****_ES","group":"******_CRM"}]}]}}',{"tag":"All_S*****_BR",{"tag":"P_90_or_S_90_BR",{"tag":"P_90_or_S_90_ESLA",{"and":[{"tag":"P_90_or_S_90_ES","group":"******_CRM"}]}]}}']})

我怎样才能实现以下所需的输出？（在同一个 DF 中或在单独的 DF 中，我认为这是一种可能的可能性）：

test_df_desired = pd.DataFrame({"PN_id":["745d626b","segment":["67537044-27db-4a0b-b5b7-362c9c5b2ba7","67537044-27db-4a0b-b5b7-362c9c5b2ba7","850c8d94-1236-45a1-93fc-08b0337b4059","850c8d94-1236-45a1-93fc-08b0337b4059"],"tag_1":["BR","BR","All_S****_ES","P_90_or_S_90_ESLA","P_90_or_S_90_ES","P_90_or_S_90_ES"],"group_1":["ua_locale_country","ua_locale_country","******_CRM","******_CRM"],"tag_2":["90_P******_BR","90_P******_BR",np.nan,"All_S*****_BR","P_90_or_S_90_BR",np.nan],"group_2":["******_CRM",np.nan]})

到目前为止，使用 pd.json_normalize(test_df["PN_raw"].apply(ast.literal_eval),record_path = ["audience","and"])，我已经设法解除了 dict 路径结构为 audience -> and 的记录的嵌套，但是对于路径为 audience -> and -> and 的记录，这不起作用，我也不能绕过它添加 record_path = ["audience","and","and"] 我认为可以工作。我认为这需要通过系列循环解决并根据是否包含一个或两个+“和”应用不同的函数来解决

当前输出不仅在上面提到的“失败”，而且还有将数据“转置”到正确行的问题（如果你运行上面的那一行，你就会明白我的意思）。

解决方法

import json


def promote(d):
    if list(d.keys()) == ['and']:
        for i in d['and']:
            yield from promote(i)
    else:
        yield d

parsed = []
data = {"PN_id": ["745d626b","745d626b","fce503fb","df3d727e","56c00531","72ebb2b3","5d1bc5d3","5c32fc8a","5c32fc8a"],"PN_raw": ['{"audience":{"and":[{"segment":"67537044-27db-4a0b-b5b7-362c9c5b2ba7"},{"tag":"BR","group":"ua_locale_country"},{"tag":"90_P******_BR","group":"******_CRM"}]}}','{"audience":{"and":[{"segment":"67537044-27db-4a0b-b5b7-362c9c5b2ba7"},'{"audience":{"and":[{"and":[{"segment":"850c8d94-1236-45a1-93fc-08b0337b4059"}]},{"and":[{"tag":"All_S****_ES","group":"******_CRM"}]}]}}',{"tag":"All_S*****_BR",{"tag":"P_90_or_S_90_BR",{"tag":"P_90_or_S_90_ESLA",{"and":[{"tag":"P_90_or_S_90_ES","group":"******_CRM"}]}]}}']}

data['PN'] = list(map(json.loads,data['PN_raw']))
for ind,pn_id in enumerate(data['PN_id']):
    parsed_sub = {'PN_id': pn_id}
    count = 1
    for chunk in promote(data['PN'][ind]['audience']):
        if 'segment' in chunk:
            parsed_sub.update(chunk)
        else:
            parsed_sub.update({f'{k}{count}': v for k,v in chunk.items()})
            count +=1
    parsed.append(parsed_sub)

df = pd.DataFrame(parsed)

我发现 Pandas 混淆/破坏了 JSON，我更喜欢用基本的 Python 来处理它。我想说 JSON 可以有如此不同的形状，没有什么好方法可以制作通用的“make_the_json_flat()”函数，但如果有这样的东西，我很想了解它。