如何解决Python - 忽略不可解析的字符串
我在使用 pandas
解析的文本文件中有一些字符串。其示例如下所示:
May 6,2021 12:40:05 AM CEST INFO [com.purge.PurgeManager run] PURGE: Purge all data beginning (1 threads)
May 6,2021 9:12:17 AM CEST FINE [com.noelios.restlet.http.HttpClientCall sendRequest] An error occured during the communication with the remote HTTP server.
java.net.socketTimeoutException: Read timed out
at java.net.socketInputStream.socketRead0(Native Method)
at java.net.socketInputStream.socketRead(SocketInputStream.java:116)
at java.net.socketInputStream.read(SocketInputStream.java:171)
at java.net.socketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
at com.noelios.restlet.ext.net.HttpUrlConnectionCall.getStatusCode(HttpUrlConnectionCall.java:299)
at com.noelios.restlet.http.HttpClientCall.sendRequest(HttpClientCall.java:173)
at com.noelios.restlet.ext.net.HttpUrlConnectionCall.sendRequest(HttpUrlConnectionCall.java:183)
at com.noelios.restlet.http.HttpClientConverter.commit(HttpClientConverter.java:109)
at com.noelios.restlet.http.HttpClientHelper.handle(HttpClientHelper.java:88)
at org.restlet.Client.handle(Client.java:120)
at org.restlet.Uniform.handle(Uniform.java:106)
at com.boomi.container.core.MessagePollerThread.run(MessagePollerThread.java:273)
at java.lang.Thread.run(Thread.java:748)
由于文件没有列标题、分隔符和动态宽度值,我使用 str.strip()
逐行读取,然后创建一个带有列标题和逗号分隔的新文件。此外,在写入输出文件之前,我使用 dateutil.parser.parse
将日期字符串转换为日期对象:
data = []
with open(inputFile,"r") as f_in:
for line in map(str.strip,f_in):
if not line:
continue
line = line.split(maxsplit=6)
logdate = " ".join(line[:6])
logstatus = line[-1].split(maxsplit=1)[0]
loginfo = line[-1].split(maxsplit=1)[-1]
data.append({"LogDate": logdate,"LogStatus": logstatus,"LogInfo": loginfo})
df = pd.DataFrame(data)
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse,ignoretz=True)
但是,对于以另一个字符串(即 java.net.socket...)开头的那些行,除了 date 我在尝试解析时遇到错误,因为它无法解析,这是正确的。我怎么能通过这个?如果可以解析字符串,我希望这样做,否则忽略并且什么都不做。我试过了,但是当它到达 except
块时,它会更新所有输出文件。
try:
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse,ignoretz=True)
except Exception as e:
pass
LogDate,LogStatus,LogInfo
"May 6,2021 12:40:05 AM CEST",INFO,[com.purge.PurgeManager run] PURGE: Purge all data beginning (1 threads)
"May 6,2021 9:12:17 AM CEST",FINE,[com.noelios.restlet.http.HttpClientCall sendRequest] An error occured during the communication with the remote HTTP server.
java.net.socketTimeoutException: Read timed out,out,out
at java.net.socketInputStream.socketRead0(Native Method),Method),Method)
at java.net.socketInputStream.socketRead(SocketInputStream.java:116),java.
我在这里遗漏了什么?
解决方法
你可以试试这个:
months = (
"January","February","March","April","May","June","July","August","September","October","November","December",)
data = []
with open(inputFile,"r") as f_in:
for line in map(str.strip,f_in):
# Add a new condition
if not line or not line.startswith(months):
continue
line = line.split(maxsplit=6)
logdate = " ".join(line[:6])
logstatus = line[-1].split(maxsplit=1)[0]
loginfo = line[-1].split(maxsplit=1)[-1]
data.append({"LogDate": logdate,"LogStatus": logstatus,"LogInfo": loginfo})
df = pd.DataFrame(data)
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse,ignoretz=True)
print(df)
# Outputs
LogDate LogStatus LogInfo
0 2021-05-06 00:40:05 INFO [com.purge.PurgeManager run] PURGE: Purge all ...
1 2021-05-06 09:12:17 FINE [com.noelios.restlet.http.HttpClientCall sendR...
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。