微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

根据 Pandas Python 中另一个数据帧的条件从一个数据帧中删除行

如何解决根据 Pandas Python 中另一个数据帧的条件从一个数据帧中删除行

我有两个 Pandas 数据框,在 python 中包含数百万行。我想根据三个条件从第一个包含单词的数据框中删除行:

  1. 如果单词连续出现在句子的开头
  2. 如果该词连续出现在句尾
  3. 如果该词出现在连续句子的中间(准确的词,不是子集)

示例:

一个数据框:

This is the first sentence
Second this is another sentence
This is the third sentence forth
This is fifth sentence
This is fifth_sentence 

第二个数据框:

Second
forth
fifth

预期输出

This is the first sentence
This is fifth_sentence 

请注意,我在两个数据框中都有数百万条记录,我该如何处理并以最有效的方式导出?

我试过了,但需要很长时间

import pandas as pd
import re

bad_words_file_data = pd.read_csv("words.txt",sep = ",",header = None)
sentences_file_data = pd.read_csv("setences.txt",sep = ".",header = None)

bad_words_index = []
for i in sentences_file_data.index:
    print("Processing Sentence:- ",i,"\n")
    single_sentence = sentences_file_data[0][i]
    for j in bad_words_file_data.index:
        word = bad_words_file_data[0][j]
        if single_sentence.endswith(word) or single_sentence.startswith(word) or word in single_sentence.split(" "):
            bad_words_index.append(i)
            break
            
sentences_file_data = sentences_file_data.drop(index=bad_words_index)
sentences_file_data.to_csv("filtered.txt",header = None,index = False)

谢谢

解决方法

您可以使用 numpy.where 函数并创建一个名为“remove”的变量,如果您列出的条件得到满足,该变量将标记为 1。首先,创建一个值为 df2

的列表

条件 1: 将检查单元格值是否以列表中的任何值开头

条件 2: 与上述相同,但会检查单元格值是否以列表中的任何值结尾

条件 3: 拆分每个单元格并检查拆分器字符串中是否有任何值在您的列表中

此后,您可以通过过滤掉 1 来创建新的数据框:

# Imports
import pandas as pd
import numpy as np

# Get the values from df2 in a list
l = list(set(df2['col']))

# Set conditions
c = df['col']

cond = (c.str.startswith(tuple(l)) \
        |(c.str.endswith(tuple(l))) \
        |pd.DataFrame(c.str.split(' ').tolist()).isin(l).any(1))

# Assign 1 or 0
df['remove'] = np.where(cond,1,0)

# Create 
out = (df[df['remove']!=1]).drop(['remove'],axis=1)

out 打印:

                          col
0  This is the first sentence
4      This is fifth_sentence

参考:

Pandas Row Select Where String Starts With Any Item In List

check if a columns contains any str from list

使用的数据框:

>>> df.to_dict()

{'col': {0: 'This is the first sentence',1: 'Second this is another sentence',2: 'This is the third sentence forth',3: 'This is fifth sentence',4: 'This is fifth_sentence'}}

>>> df2.to_dict()

Out[80]: {'col': {0: 'Second',1: 'forth',2: 'fifth'}}

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。