如何解决使用嵌套信息在 R 中抓取 PDF
我正在尝试使用 pdftools::pdf_text
和 tabulizer::extract_tables
在 R 中抓取相当困难的 PDF。但是,在我的情况下,根据 PDF 的性质,这些似乎都没有太大帮助。 PDF 包含“嵌套”信息,如图所示。
解决这个问题的最佳方法是什么?使用 stringr::str_split_fixed
和 n=3
按空格分割给了我矩阵,但是创建正则表达式来检测每个中我想要的信息(仅在描述和事件日期/时间之后)似乎太难了列。
解决方法
我认为正则表达式方法并没有那么复杂:
library(pdftools)
library(tidyverse)
library(magrittr)
mylog <- "https://www.lsu.edu/police/files/crime-log/2021/jan2021.pdf"
pdf.text <- pdf_text(mylog)
map_dfr(pdf.text,~ {
str_split(.x,"\\n") %>% unlist() -> vectors;
vectors %>% str_detect("^Case") %>% which %>% add(1) -> cases
vectors %>% str_detect("^Desc") %>% which %>% add(1) -> descriptions
vectors %>% str_detect("^Addr") %>% which %>% add(1) -> addresses
vectors[cases] %>% str_split("(\\s{2,}|\\s(?=[0-9]{1,2}/)|(?<=[AP]M)\\s+)") %>%
map_dfr(~setNames(.,c("Case.Number","Date.Report","Date.Incident","Case.Status")[seq_along(.)])) -> cases
vectors[descriptions] %>% str_split("\\s{2,}") %>%
map_dfr(~setNames(.,c("Description","Date.Incident.End")[seq_along(.)])) -> descriptions
bind_cols(cases,descriptions,data.frame(Address = vectors[addresses]))
})
# A tibble: 155 x 7
Case.Number Date.Report Date.Incident Case.Status Description Date.Incident.End Address
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 20210101-001 January 01,20… 1/1/2021 10:28:0… Inactive COMPLAINT ANIMAL 1/1/2021 10:28:00AM UREC FIELDS
2 20210101-002 January 01,20… 1/1/2021 2:48:00… Inactive 911 HNGUP/OP - 911 HANG-UP/O… 1/1/2021 2:48:00PM PMAC
3 20210101-003 January 01,20… 1/1/2021 3:27:00… Pending UNAUTHORIZED ENTRY OF A PLAC… 1/1/2021 3:27:00PM COMPANION ANIMAL AL…
4 20210102-001 January 02,20… 1/2/2021 5:12:00… Inactive SUSPICIOUS INCIDENT 1/2/2021 5:12:00PM TIGER STADIUM
5 20210103-001 January 03,20… 12/23/2020 12:00… Pending HIT AND RUN 1/3/2021 9:15:00AM BROUSSARD HALL TRAF…
6 20210103-002 January 03,20… 1/3/2021 9:28:46… Inactive DISTURBANCE 1/3/2021 9:28:00PM VET SCHOOL
7 20210104-001 January 04,20… 11/23/2018 11:00… Inactive NONCRIMINAL INFORMATION ONLY 11/23/2018 11:00:0… Oaks Lot
8 20210104-002 January 04,20… 1/4/2021 7:26:00… Inactive SUSPICIOUS INCIDENT 1/4/2021 7:26:00AM ECE
9 20210104-003 January 04,20… 8/1/2017 12:00:0… Pending INVESTIGATN - INVESTIGATION 1/2/2021 3:00:00PM EAST CAMPUS APARTME…
10 20210104-004 January 04,20… 1/4/2021 12:30:0… Pending HIT AND RUN 1/4/2021 12:30:00PM HIGHLAND ROAD @ STU…
# … with 145 more rows
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。