微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

数据框-找到匹配项后停止搜索并导出数据 输出

如何解决数据框-找到匹配项后停止搜索并导出数据 输出

我有一个小程序可以搜索许多大文件(每个文件+500.000行)并将结果导出到csv文件。我想知道在文件中找到特定日期后是否可以停止搜索。例如,找到ini_date(第2列)值(例如02/12/2020)后,程序应停止搜索并导出结果,其中包括第2列中包含“ 02/12/2020”并与其他搜索条件匹配的行

当前,我的文件夹中有73个datalog.log文件,并且这个数目正在不断增加。 datalog0.log是较旧的文件,而datalog72.log是最新的文件,有时它将是datalog73.log(我想在最新的文件中开始搜索)。这可能只用python吗?如果没有,我将不得不为此使用sql

在这里您可以看到我的代码

import pandas as pd
from glob import glob

files = glob('C:/ProgramA/datalog*.log')
df = pd.concat([pd.read_csv(f,low_memory=False
                  sep=',',names=["0","1","2","3","4","5","6","7"]) for f in files])


#Column 0: IP
#Column 1: User
#Column 2: Date
#Column 3: Hour

ip = input('Optional - Set IP: ')  #column 0
user = input('Optional - Set User: ')     #column 1
ini_date = input('Mandatory - From Day (Formant MM/DD/YYYY): ')   
fin_date = input('Mandatory - To Day (Formant MM/DD/YYYY): ')  
ini_hour = input('Mandatory - From Hour (Formant 00:00:00): ')  
fin_hour = input('Mandatory - To Hour (Formant 00:00:00): ')   

if ip == '' and user == '':
    df1 = df[(df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
elif ip == '':
    df1 = df[(df["1"] == user) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
elif user == '':
    df1 = df[(df["0"] == ip) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
else:
    df1 = df[(df["0"] == ip) & (df["1"] == user) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]

df1.to_csv ('C:/ProgramA/result.csv',index = False) 

谢谢。


日志如下例所示:

是的,日志是连续的,并以这种方式查看:

File0:
        1.1.1.1      user       09/24/2020       09:18:00    Other data...................
        1.1.1.1      user       09/24/2020       10:00:00    Other data...................
        1.1.1.1      user       09/25/2020       07:30:00    Other data...................
        1.1.1.1      user       09/25/2020       09:30:00    Other data...................

File1:
        1.1.1.1      user       09/26/2020       04:18:00    Other data...................
        1.1.1.1      user       09/26/2020       10:00:00    Other data...................
        1.1.1.1      user       09/26/2020       11:18:00    Other data...................
        1.1.1.1      user       09/26/2020       12:00:00    Other data...................

File2:
        1.1.1.1      user       09/26/2020       14:18:00    Other data...................
        1.1.1.1      user       09/27/2020       16:00:00    Other data...................
        1.1.1.1      user       09/28/2020       10:18:00    Other data...................
        1.1.1.1      user       09/29/2020       12:00:00    Other data...................

因此,如果我按ini_date> =“ 09/27/2020”和fin_date

        1.1.1.1      user       09/27/2020       16:00:00    Other data...................
        1.1.1.1      user       09/28/2020       10:18:00    Other data...................

解决方法

import glob
import os
import pandas as pd

list_of_files = glob.glob('/path/to/folder/*')

# Sorts files based on creation date
sorted_file_names = sorted(list_of_files,key=os.path.getctime,reverse = True)

rows_found = False
final_df = pd.DataFrame()
for file in sorted_file_names:
    df = pd.read_csv(file)

    # {Perform required operations}

    # Fetches required rows
    df1 = df.loc[(df['2'] <= fin_date) & (df['2'] >= ini_date)]
 
    # If required rows don't exist in current file but existed in previous file,break
    if not df1.empty:
        rows_found = True
        final_df = final_df.append(df1,ignore_index=False)
    elif rows_found:
        break

final_df.to_csv("Name.csv")
,

@Shradha给出的答案应该找到/获取您要搜索的日期的所有条目,一旦您拥有所有这些条目,就可以将其他过滤器单独应用于这些条目子集以节省计算和时间。 / p>

最初,我认为将日期设置为数据框的索引会减少查找日志条目的时间,但是我错了。布尔掩码比索引工作得更快。

import pandas as pd
import datetime
import numpy as np
import time

if __name__ == '__main__':
    df = pd.read_csv('~/Documents/tmp.csv',names=["version","user","date","time","data1","data2"])
    df.set_index('date',inplace=True)
    df.index = pd.to_datetime(df.index,dayfirst=True)
    print df.loc[datetime.date(2020,9,27)]
    print '############################'

    date_index = pd.date_range(start='1/1/1850',periods=100000)  # 100000 entries
    some_data = pd.Series(np.random.randint(1,100,size=date_index.shape))
    df = pd.DataFrame(data={'some_data': some_data})
    df.index = date_index
    df = df.append([df,df,df])
    print 'shape of df is: ',df.shape

    start = time.time()
    print df.loc[datetime.date(2020,3,14)]
    end = time.time()
    print"time taken is: ",end - start
    print '############################'

    df.reset_index(inplace=True)
    df.columns = ['my_index','some_data']
    start = time.time()
    print df.loc[df['my_index'] == datetime.date(2020,end - start
    print '############################'

输出

version     1.1.1.1
user           user
time       16:00:00
data1         Other
data2          data
Name: 2020-09-27 00:00:00,dtype: object
############################
shape of df is:  (500000,1)
version     1.1.1.1
user           user
time       16:00:00
data1         Other
data2          data
Name: 2020-09-27 00:00:00,1)
############################
Through direct indexing
            some_data
2020-03-14         93
2020-03-14         93
2020-03-14         93
2020-03-14         93
2020-03-14         93
time taken is:  0.0407321453094
############################
Using boolean mask
         my_index  some_data
62164  2020-03-14         93
162164 2020-03-14         93
262164 2020-03-14         93
362164 2020-03-14         93
462164 2020-03-14         93
time taken is:  0.00653505325317
############################

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。