如何解决删除熊猫中的重复数据
我有 4 列数据 vagrant@coton:~$ curl -X GET http://localhost:8083/connectors/pg-sport-connector/ | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 940 100 940 0 0 50165 0 --:--:-- --:--:-- --:--:-- 52222
{
"name": "pg-sport-connector","config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector","database.user": "postgres","database.dbname": "sport","transforms": "unwrap","database.server.name": "vagrant","database.port": "5432","plugin.name": "pgoutput","table.whitelist": "bet.event,bet.t","internal.key.converter": "org.apache.kafka.connect.json.JsonConverter","key.converter.schemas.enable": "false","decimal.handling.mode": "string","database.hostname": "localhost","database.password": "","value.converter.schemas.enable": "false","internal.value.converter": "org.apache.kafka.connect.json.JsonConverter","name": "pg-sport-connector","transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState","value.converter": "org.apache.kafka.connect.json.JsonConverter","database.whitelist": "sport","key.converter": "org.apache.kafka.connect.json.JsonConverter"
},"tasks": [
{
"connector": "pg-sport-connector","task": 0
}
],"type": "source"
}
vagrant@coton:~$ curl -X GET http://localhost:8083/connectors/pg-sport-connector/status | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 172 100 172 0 0 10876 0 --:--:-- --:--:-- --:--:-- 11466
{
"name": "pg-sport-connector","connector": {
"state": "RUNNING","worker_id": "127.0.1.1:8083"
},"tasks": [
{
"id": 0,"state": "RUNNING","worker_id": "127.0.1.1:8083"
}
],"type": "source"
}
、A
、B
、C
。
某些数据正在重复,例如第 1 行:D
在第 5 行中重复:P2 XX P6 XX
。谁能帮我从 Pandas 数据框中删除重复单元?
P6 XX P2 XX
输出:
A B C D
P2 XX P6 XX
P3 XX P5 XX
P5 XX P8 XX
P5 XX P3 XX
P6 XX P2 XX
P8 XX P5 XX
P1 LU P2 LU
P2 LU P1 LU
P3 LU P9 LU
P3 LU P6 LU
P6 LU P3 LU
P9 LU P3 LU
解决方法
假设可以交换列 A
和 C
,您可以使用 np.minimum
和 np.maximum
交换两列,然后删除重复项:
import numpy as np
df.A,df.C = np.minimum(df.A,df.C),np.maximum(df.A,df.C)
df.drop_duplicates()
A B C D
0 P2 XX P6 XX
1 P3 XX P5 XX
2 P5 XX P8 XX
6 P1 LU P2 LU
8 P3 LU P9 LU
9 P3 LU P6 LU
,
我们可以在axis=1 上使用np.sort
对行中的排序值进行排序,然后在已排序的框架上使用drop_duplicates
。最后,使用索引过滤df
:
import numpy as np
idx = (
pd.DataFrame(
np.sort(df.values,axis=1),columns=df.columns
).drop_duplicates().index
)
df = df.loc[idx]
或者没有第二个变量:
df = df.loc[
pd.DataFrame(
np.sort(df.values,columns=df.columns
).drop_duplicates().index
]
df
:
A B C D
0 P2 XX P6 XX
1 P3 XX P5 XX
2 P5 XX P8 XX
6 P1 LU P2 LU
8 P3 LU P9 LU
9 P3 LU P6 LU
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。