如何解决将字典嵌套到 Pandas df 连接行
给定以下字典:
j = {
"source": "https://example.com","timestamp": "2021-04-12T19:34:24Z","durationInTicks": 1082400000,"duration": "PT1M48.24S","combinedRecognizedPhrases": [
{
"channel": 0,"lexical": "aaa","itn": "aaa","maskedITN": "aaa","display": "aaa"
}
],"recognizedPhrases": [
{
"recognitionStatus": "Success","channel": 0,"speaker": 1,"offset": "PT2.18S","duration": "PT3.88S","offsetInTicks": 21800000,"durationInTicks": 38800000,"nBest": [
{
"confidence": 0.9306252,"lexical": "gracias por llamar","itn": "gracias por llamar","maskedITN": "gracias por llamar","display": "¿Gracias por llamar","words": [
{
"word": "gracias","duration": "PT0.37S","durationInTicks": 3700000,"confidence": 0.930625
},{
"word": "por","offset": "PT2.55S","duration": "PT0.18S","offsetInTicks": 25500000,"durationInTicks": 1800000,{
"word": "llamar","offset": "PT2.73S","duration": "PT0.22S","offsetInTicks": 27300000,"durationInTicks": 2200000,"confidence": 0.930625
}
]
}
]
},{
"recognitionStatus": "Success","speaker": 2,"offset": "PT6.85S","duration": "PT5.63S","offsetInTicks": 68500000,"durationInTicks": 56300000,"nBest": [
{
"confidence": 0.9306253,"lexical": "quiero hacer un pago","itn": "quiero hacer un pago","maskedITN": "quiero hacer un pago","display": "quiero hacer un pago"
}
]
},"offset": "PT13.29S","duration": "PT3.81S","offsetInTicks": 132900000,"durationInTicks": 38100000,"nBest": [
{
"confidence": 0.93062526,"lexical": "no sé bien la cantidad","itn": "no sé bien la cantidad","maskedITN": "no sé bien la cantidad","display": "no sé bien la cantidad"
}
]
}
]
}
目标:在df的单行中获取感兴趣的信息。
到目前为止我做了什么?:
df = pd.json_normalize(j,record_path=['recognizedPhrases','nBest'],Meta=['source','durationInTicks','duration',['recognizedPhrases','speaker']])
df['speech'] = df.groupby(['source','recognizedPhrases.speaker'])['display'].transform(lambda x : ' '.join(x))
df = df.drop_duplicates(subset=['recognizedPhrases.speaker'])
为什么我对获得的输出不满意?:我的输出显示了一个包含两行的 df(每个 recognizedPhrases.speaker
一行),我需要将所有信息合二为一行,一列是说话者 1 所说的话(在 speaker
列中),另一列是 speaker
2 所说的话。
编辑 1: 我期望的结果看起来像这样:
expected_dict = {'source': {0: 'https://example.com'},'durationInTicks': {0: 1082400000},'duration': {0: 'PT1M48.24S'},'recognizedPhrases.speaker1': {0: '¿Gracias por llamar'},'recognizedPhrases.speaker2': {0: 'quiero hacer un pago no sé bien la cantidad'}}
expected_df = pd.DataFrame(expected_dict)
解决方法
您可以pivot()
进入预期的输出:
index = ['source','durationInTicks','duration']
columns = ['recognizedPhrases.speaker']
values= ['speech']
df = df[index+columns+values].pivot(index=index,columns=columns,values=values[0])
df.columns = [f'{df.columns.name}{column}' for column in df.columns]
来源 | durationInTicks | 持续时间 | recognizedPhrases.speaker1 | recognizedPhrases.speaker2 |
---|---|---|---|---|
https://example.com | 1082400000 | PT1M48.24S | ¿Gracias por llamar | quiero hacer un pago no sé bien la cantidad |
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。