将字典嵌套到 Pandas df 连接行

如何解决将字典嵌套到 Pandas df 连接行

给定以下字典:

j = {
  "source": "https://example.com","timestamp": "2021-04-12T19:34:24Z","durationInTicks": 1082400000,"duration": "PT1M48.24S","combinedRecognizedPhrases": [
    {
      "channel": 0,"lexical": "aaa","itn": "aaa","maskedITN": "aaa","display": "aaa"
    }
  ],"recognizedPhrases": [
    {
      "recognitionStatus": "Success","channel": 0,"speaker": 1,"offset": "PT2.18S","duration": "PT3.88S","offsetInTicks": 21800000,"durationInTicks": 38800000,"nBest": [
        {
          "confidence": 0.9306252,"lexical": "gracias por llamar","itn": "gracias por llamar","maskedITN": "gracias por llamar","display": "¿Gracias por llamar","words": [
            {
              "word": "gracias","duration": "PT0.37S","durationInTicks": 3700000,"confidence": 0.930625
            },{
              "word": "por","offset": "PT2.55S","duration": "PT0.18S","offsetInTicks": 25500000,"durationInTicks": 1800000,{
              "word": "llamar","offset": "PT2.73S","duration": "PT0.22S","offsetInTicks": 27300000,"durationInTicks": 2200000,"confidence": 0.930625
            }
          ]
        }
      ]
    },{
      "recognitionStatus": "Success","speaker": 2,"offset": "PT6.85S","duration": "PT5.63S","offsetInTicks": 68500000,"durationInTicks": 56300000,"nBest": [
        {
          "confidence": 0.9306253,"lexical": "quiero hacer un pago","itn": "quiero hacer un pago","maskedITN": "quiero hacer un pago","display": "quiero hacer un pago"
        }
      ]
    },"offset": "PT13.29S","duration": "PT3.81S","offsetInTicks": 132900000,"durationInTicks": 38100000,"nBest": [
        {
          "confidence": 0.93062526,"lexical": "no sé bien la cantidad","itn": "no sé bien la cantidad","maskedITN": "no sé bien la cantidad","display": "no sé bien la cantidad"
        }
      ]
    }
  ]
}

目标:在df的单行中获取感兴趣的信息。

到目前为止我做了什么?

df = pd.json_normalize(j,record_path=['recognizedPhrases','nBest'],Meta=['source','durationInTicks','duration',['recognizedPhrases','speaker']])
df['speech'] = df.groupby(['source','recognizedPhrases.speaker'])['display'].transform(lambda x : ' '.join(x))
df = df.drop_duplicates(subset=['recognizedPhrases.speaker'])

获得的df

enter image description here

为什么我对获得的输出不满意?:我的输出显示一个包含两行的 df(每个 recognizedPhrases.speaker 一行),我需要将所有信息合二为一行,一列是说话者 1 所说的话(在 speaker 列中),另一列是 speaker 2 所说的话。

附加信息性能一个重要因素,因为我将处理数千个文件

编辑 1: 我期望的结果看起来像这样:

expected_dict = {'source': {0: 'https://example.com'},'durationInTicks': {0: 1082400000},'duration': {0: 'PT1M48.24S'},'recognizedPhrases.speaker1': {0: '¿Gracias por llamar'},'recognizedPhrases.speaker2': {0: 'quiero hacer un pago no sé bien la cantidad'}}
expected_df = pd.DataFrame(expected_dict)

解决方法

您可以pivot()进入预期的输出:

index = ['source','durationInTicks','duration']
columns = ['recognizedPhrases.speaker']
values= ['speech']

df = df[index+columns+values].pivot(index=index,columns=columns,values=values[0])
df.columns = [f'{df.columns.name}{column}' for column in df.columns]
来源 durationInTicks 持续时间 recognizedPhrases.speaker1 recognizedPhrases.speaker2
https://example.com 1082400000 PT1M48.24S ¿Gracias por llamar quiero hacer un pago no sé bien la cantidad

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?
Java在半透明框架/面板/组件上重新绘画。
Java“ Class.forName()”和“ Class.forName()。newInstance()”之间有什么区别?
在此环境中不提供编译器。也许是在JRE而不是JDK上运行?
Java用相同的方法在一个类中实现两个接口。哪种接口方法被覆盖?
Java 什么是Runtime.getRuntime()。totalMemory()和freeMemory()?
java.library.path中的java.lang.UnsatisfiedLinkError否*****。dll
JavaFX“位置是必需的。” 即使在同一包装中
Java 导入两个具有相同名称的类。怎么处理?
Java 是否应该在HttpServletResponse.getOutputStream()/。getWriter()上调用.close()?
Java RegEx元字符(。)和普通点?