如何解决数据框-找到匹配项后停止搜索并导出数据 输出
我有一个小程序可以搜索许多大文件(每个文件+500.000行)并将结果导出到csv文件。我想知道在文件中找到特定日期后是否可以停止搜索。例如,找到ini_date(第2列)值(例如02/12/2020)后,程序应停止搜索并导出结果,其中包括第2列中包含“ 02/12/2020”并与其他搜索条件匹配的行
当前,我的文件夹中有73个datalog.log文件,并且这个数目正在不断增加。 datalog0.log是较旧的文件,而datalog72.log是最新的文件,有时它将是datalog73.log(我想在最新的文件中开始搜索)。这可能只用python吗?如果没有,我将不得不为此使用sql。
import pandas as pd
from glob import glob
files = glob('C:/ProgramA/datalog*.log')
df = pd.concat([pd.read_csv(f,low_memory=False
sep=',',names=["0","1","2","3","4","5","6","7"]) for f in files])
#Column 0: IP
#Column 1: User
#Column 2: Date
#Column 3: Hour
ip = input('Optional - Set IP: ') #column 0
user = input('Optional - Set User: ') #column 1
ini_date = input('Mandatory - From Day (Formant MM/DD/YYYY): ')
fin_date = input('Mandatory - To Day (Formant MM/DD/YYYY): ')
ini_hour = input('Mandatory - From Hour (Formant 00:00:00): ')
fin_hour = input('Mandatory - To Hour (Formant 00:00:00): ')
if ip == '' and user == '':
df1 = df[(df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
elif ip == '':
df1 = df[(df["1"] == user) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
elif user == '':
df1 = df[(df["0"] == ip) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
else:
df1 = df[(df["0"] == ip) & (df["1"] == user) & (df["2"] >= ini_date) & (df["2"] <= fin_date) & (df["3"] >= ini_hour) & (df["3"] <= fin_hour)]
df1.to_csv ('C:/ProgramA/result.csv',index = False)
谢谢。
日志如下例所示:
是的,日志是连续的,并以这种方式查看:
File0:
1.1.1.1 user 09/24/2020 09:18:00 Other data...................
1.1.1.1 user 09/24/2020 10:00:00 Other data...................
1.1.1.1 user 09/25/2020 07:30:00 Other data...................
1.1.1.1 user 09/25/2020 09:30:00 Other data...................
File1:
1.1.1.1 user 09/26/2020 04:18:00 Other data...................
1.1.1.1 user 09/26/2020 10:00:00 Other data...................
1.1.1.1 user 09/26/2020 11:18:00 Other data...................
1.1.1.1 user 09/26/2020 12:00:00 Other data...................
File2:
1.1.1.1 user 09/26/2020 14:18:00 Other data...................
1.1.1.1 user 09/27/2020 16:00:00 Other data...................
1.1.1.1 user 09/28/2020 10:18:00 Other data...................
1.1.1.1 user 09/29/2020 12:00:00 Other data...................
因此,如果我按ini_date> =“ 09/27/2020”和fin_date
1.1.1.1 user 09/27/2020 16:00:00 Other data...................
1.1.1.1 user 09/28/2020 10:18:00 Other data...................
解决方法
import glob
import os
import pandas as pd
list_of_files = glob.glob('/path/to/folder/*')
# Sorts files based on creation date
sorted_file_names = sorted(list_of_files,key=os.path.getctime,reverse = True)
rows_found = False
final_df = pd.DataFrame()
for file in sorted_file_names:
df = pd.read_csv(file)
# {Perform required operations}
# Fetches required rows
df1 = df.loc[(df['2'] <= fin_date) & (df['2'] >= ini_date)]
# If required rows don't exist in current file but existed in previous file,break
if not df1.empty:
rows_found = True
final_df = final_df.append(df1,ignore_index=False)
elif rows_found:
break
final_df.to_csv("Name.csv")
,
@Shradha给出的答案应该找到/获取您要搜索的日期的所有条目,一旦您拥有所有这些条目,就可以将其他过滤器单独应用于这些条目子集以节省计算和时间。 / p>
最初,我认为将日期设置为数据框的索引会减少查找日志条目的时间,但是我错了。布尔掩码比索引工作得更快。
import pandas as pd
import datetime
import numpy as np
import time
if __name__ == '__main__':
df = pd.read_csv('~/Documents/tmp.csv',names=["version","user","date","time","data1","data2"])
df.set_index('date',inplace=True)
df.index = pd.to_datetime(df.index,dayfirst=True)
print df.loc[datetime.date(2020,9,27)]
print '############################'
date_index = pd.date_range(start='1/1/1850',periods=100000) # 100000 entries
some_data = pd.Series(np.random.randint(1,100,size=date_index.shape))
df = pd.DataFrame(data={'some_data': some_data})
df.index = date_index
df = df.append([df,df,df])
print 'shape of df is: ',df.shape
start = time.time()
print df.loc[datetime.date(2020,3,14)]
end = time.time()
print"time taken is: ",end - start
print '############################'
df.reset_index(inplace=True)
df.columns = ['my_index','some_data']
start = time.time()
print df.loc[df['my_index'] == datetime.date(2020,end - start
print '############################'
输出
version 1.1.1.1
user user
time 16:00:00
data1 Other
data2 data
Name: 2020-09-27 00:00:00,dtype: object
############################
shape of df is: (500000,1)
version 1.1.1.1
user user
time 16:00:00
data1 Other
data2 data
Name: 2020-09-27 00:00:00,1)
############################
Through direct indexing
some_data
2020-03-14 93
2020-03-14 93
2020-03-14 93
2020-03-14 93
2020-03-14 93
time taken is: 0.0407321453094
############################
Using boolean mask
my_index some_data
62164 2020-03-14 93
162164 2020-03-14 93
262164 2020-03-14 93
362164 2020-03-14 93
462164 2020-03-14 93
time taken is: 0.00653505325317
############################
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。