Python - 忽略不可解析的字符串

如何解决Python - 忽略不可解析的字符串

我在使用 pandas 解析的文本文件中有一些字符串。其示例如下所示:

May 6,2021 12:40:05 AM CEST INFO    [com.purge.PurgeManager run] PURGE: Purge all data beginning (1 threads)
May 6,2021 9:12:17 AM CEST FINE    [com.noelios.restlet.http.HttpClientCall sendRequest] An error occured during the communication with the remote HTTP server.
java.net.socketTimeoutException: Read timed out
    at java.net.socketInputStream.socketRead0(Native Method)
    at java.net.socketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.socketInputStream.read(SocketInputStream.java:171)
    at java.net.socketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    at sun.security.ssl.InputRecord.read(InputRecord.java:503)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
    at com.noelios.restlet.ext.net.HttpUrlConnectionCall.getStatusCode(HttpUrlConnectionCall.java:299)
    at com.noelios.restlet.http.HttpClientCall.sendRequest(HttpClientCall.java:173)
    at com.noelios.restlet.ext.net.HttpUrlConnectionCall.sendRequest(HttpUrlConnectionCall.java:183)
    at com.noelios.restlet.http.HttpClientConverter.commit(HttpClientConverter.java:109)
    at com.noelios.restlet.http.HttpClientHelper.handle(HttpClientHelper.java:88)
    at org.restlet.Client.handle(Client.java:120)
    at org.restlet.Uniform.handle(Uniform.java:106)
    at com.boomi.container.core.MessagePollerThread.run(MessagePollerThread.java:273)
    at java.lang.Thread.run(Thread.java:748)

由于文件没有列标题、分隔符和动态宽度值,我使用 str.strip() 逐行读取,然后创建一个带有列标题和逗号分隔的新文件。此外,在写入输出文件之前,我使用 dateutil.parser.parse 将日期字符串转换为日期对象:

data = []
with open(inputFile,"r") as f_in:
    for line in map(str.strip,f_in):
        if not line:
            continue
        line = line.split(maxsplit=6)
        logdate = " ".join(line[:6])
        logstatus = line[-1].split(maxsplit=1)[0]
        loginfo = line[-1].split(maxsplit=1)[-1]
        data.append({"LogDate": logdate,"LogStatus": logstatus,"LogInfo": loginfo})

df = pd.DataFrame(data)

df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse,ignoretz=True)

但是,对于以另一个字符串(即 java.net.socket...)开头的那些行,除了 date 我在尝试解析时遇到错误,因为它无法解析,这是正确的。我怎么能通过这个?如果可以解析字符串,我希望这样做,否则忽略并且什么都不做。我试过了,但是当它到达 except 块时,它会更新所有输出文件

try:
    df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse,ignoretz=True)
except Exception as e:
    pass

输出文件

LogDate,LogStatus,LogInfo
"May 6,2021 12:40:05 AM CEST",INFO,[com.purge.PurgeManager run] PURGE: Purge all data beginning (1 threads)
"May 6,2021 9:12:17 AM CEST",FINE,[com.noelios.restlet.http.HttpClientCall sendRequest] An error occured during the communication with the remote HTTP server.
java.net.socketTimeoutException: Read timed out,out,out
at java.net.socketInputStream.socketRead0(Native Method),Method),Method)
at java.net.socketInputStream.socketRead(SocketInputStream.java:116),java.

在这里遗漏了什么?

解决方法

你可以试试这个:

months = (
    "January","February","March","April","May","June","July","August","September","October","November","December",)
data = []
with open(inputFile,"r") as f_in:
    for line in map(str.strip,f_in):
        # Add a new condition
        if not line or not line.startswith(months):
            continue
        line = line.split(maxsplit=6)
        logdate = " ".join(line[:6])
        logstatus = line[-1].split(maxsplit=1)[0]
        loginfo = line[-1].split(maxsplit=1)[-1]
        data.append({"LogDate": logdate,"LogStatus": logstatus,"LogInfo": loginfo})

df = pd.DataFrame(data)
df["LogDate"] = df["LogDate"].apply(dateutil.parser.parse,ignoretz=True)

print(df)
# Outputs
              LogDate LogStatus                                            LogInfo
0 2021-05-06 00:40:05      INFO  [com.purge.PurgeManager run] PURGE: Purge all ...
1 2021-05-06 09:12:17      FINE  [com.noelios.restlet.http.HttpClientCall sendR...

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?
Java在半透明框架/面板/组件上重新绘画。
Java“ Class.forName()”和“ Class.forName()。newInstance()”之间有什么区别?
在此环境中不提供编译器。也许是在JRE而不是JDK上运行?
Java用相同的方法在一个类中实现两个接口。哪种接口方法被覆盖?
Java 什么是Runtime.getRuntime()。totalMemory()和freeMemory()?
java.library.path中的java.lang.UnsatisfiedLinkError否*****。dll
JavaFX“位置是必需的。” 即使在同一包装中
Java 导入两个具有相同名称的类。怎么处理?
Java 是否应该在HttpServletResponse.getOutputStream()/。getWriter()上调用.close()?
Java RegEx元字符(。)和普通点?