微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

删除熊猫中的重复数据

如何解决删除熊猫中的重复数据

我有 4 列数据 vagrant@coton:~$ curl -X GET http://localhost:8083/connectors/pg-sport-connector/ | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 940 100 940 0 0 50165 0 --:--:-- --:--:-- --:--:-- 52222 { "name": "pg-sport-connector","config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector","database.user": "postgres","database.dbname": "sport","transforms": "unwrap","database.server.name": "vagrant","database.port": "5432","plugin.name": "pgoutput","table.whitelist": "bet.event,bet.t","internal.key.converter": "org.apache.kafka.connect.json.JsonConverter","key.converter.schemas.enable": "false","decimal.handling.mode": "string","database.hostname": "localhost","database.password": "","value.converter.schemas.enable": "false","internal.value.converter": "org.apache.kafka.connect.json.JsonConverter","name": "pg-sport-connector","transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState","value.converter": "org.apache.kafka.connect.json.JsonConverter","database.whitelist": "sport","key.converter": "org.apache.kafka.connect.json.JsonConverter" },"tasks": [ { "connector": "pg-sport-connector","task": 0 } ],"type": "source" } vagrant@coton:~$ curl -X GET http://localhost:8083/connectors/pg-sport-connector/status | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 172 100 172 0 0 10876 0 --:--:-- --:--:-- --:--:-- 11466 { "name": "pg-sport-connector","connector": { "state": "RUNNING","worker_id": "127.0.1.1:8083" },"tasks": [ { "id": 0,"state": "RUNNING","worker_id": "127.0.1.1:8083" } ],"type": "source" } ABC。 某些数据正在重复,例如第 1 行:D 在第 5 行中重复:P2 XX P6 XX。谁能帮我从 Pandas 数据框中删除重复单元?

P6  XX  P2  XX

输出

A   B   C   D
P2  XX  P6  XX
P3  XX  P5  XX
P5  XX  P8  XX
P5  XX  P3  XX
P6  XX  P2  XX
P8  XX  P5  XX
P1  LU  P2  LU
P2  LU  P1  LU
P3  LU  P9  LU
P3  LU  P6  LU
P6  LU  P3  LU
P9  LU  P3  LU

解决方法

假设可以交换列 AC,您可以使用 np.minimumnp.maximum 交换两列,然后删除重复项:

import numpy as np
df.A,df.C = np.minimum(df.A,df.C),np.maximum(df.A,df.C)

df.drop_duplicates()
    A   B   C   D
0  P2  XX  P6  XX
1  P3  XX  P5  XX
2  P5  XX  P8  XX
6  P1  LU  P2  LU
8  P3  LU  P9  LU
9  P3  LU  P6  LU
,

我们可以在axis=1 上使用np.sort 对行中的排序值进行排序,然后在已排序的框架上使用drop_duplicates。最后,使用索引过滤df

import numpy as np


idx = (
    pd.DataFrame(
        np.sort(df.values,axis=1),columns=df.columns
    ).drop_duplicates().index
)

df = df.loc[idx]

或者没有第二个变量:

df = df.loc[
    pd.DataFrame(
        np.sort(df.values,columns=df.columns
    ).drop_duplicates().index
]

df

    A   B   C   D
0  P2  XX  P6  XX
1  P3  XX  P5  XX
2  P5  XX  P8  XX
6  P1  LU  P2  LU
8  P3  LU  P9  LU
9  P3  LU  P6  LU

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。