用stringr提取更大的字符数据体？

如何解决用stringr提取更大的字符数据体？

我正在努力从大约 1000 个 pdf 文件中抓取文本数据。我设法将它们全部导入 R-studio，使用 str_subset 和 str_extract_all 获取我需要的较小属性。该项目的主要目标是抓取案例历史叙述数据。这些是自然语言的段落，以在所有单个文档中标准化的独特词为界。请参阅下面的复制示例。

有没有办法可以使用这两个独特的词（“CASE HISTORY & INVESTIGATOR：”）来绑定我想要提取的文本？如果没有，我可以采取什么样的方法从每个报告中提取我需要的叙述性数据？

text_data <- list("ES                     SPRINGFEILD POLICE DE     FARRELL #789\n NOTIFIED                  DATE           TIME               OFFICER\nMARITAL STATUS:       UNKNowN\nIDENTIFIED BY:    H. POIROT                     AT:   SCENE              DATE:    01/02/1895\nFINGERPRINTS TAKEN BY                         DATE\n YES                      NO                  OBIWAN KENOBI                            01/02/1895\n
              SPRINGFEILD\n CASE#:       012-345-678\n ABC NOTIFIED:                                    ABC DATE:\n ABC OFFICER:                                           NATURE:\nCASE HISTORY\n    This is a string. There are many strings like it,but this one is mine. To be more specific,this is string 456 out of 5000 strings. It’s a case narrative string and\n                                            Case#:           012-345-678\n                          examineR / INVESTIGATOR'S REPORT\n                                 CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1,2 even\n     the next capitalized word,investigator with a colon,is a unique word where the string stops.\nINVESTIGATOR:       HERCULE POIROT             \n")

这是预期的输出。

output <- list("This is a string. There are many strings like it,is a unique word where the string stops.")

非常感谢您的帮助！

解决方法

一种快速的方法是使用 gsub 和正则表达式来替换所有内容，包括 CASE HISTORY ('^.*CASE HISTORY') 以及 INVESTIGATOR: ('INVESTIGATOR:.*') 之后的所有内容。剩下的就是这两个匹配项之间的文本。

gsub('INVESTIGATOR:.*','',gsub('^.*CASE HISTORY',text_data))
[1] "\n    This is a string. There are many strings like it,but this one is mine. To be more specific,this is string 456 out of 5000 strings. It’s a case narrative string and\n                                            Case#:           012-345-678\n                          EXAMINER / INVESTIGATOR'S REPORT\n                                 CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1,2 even\n     the next capitalized word,investigator with a colon,is a unique word where the string stops.\n"

经过深思熟虑，我得出了一个我认为值得分享的解决方案，所以我们开始吧：

# unlist text_data
file_contents_unlist <- 
paste(unlist(text_data),collapse = " ")

# read lines,squish for good measure. 
file_contents_lines <- 
file_contents_unlist%>% 
readr::read_lines() %>% 
str_squish()

# Create indicies in the lines of our text data based upon regex grepl 
# functions,be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \\d+[-]\\d+)",file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \\d+[-]\\d+)",file_contents_lines))

# function basically states,"give me back whatever's in those indices".
 pull_case_num <- 
  function(index_case_num_1,index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
  )
    } 
 
 # map2() to iterate. 
 case_nums <- map2(index_case_num_1,index_case_num_2,pull_case_num) 

# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)

# Repeat pattern for other vectors as needed. 
index_case_hist_1 <- 
  which(grepl("CASE HISTORY",file_contents_lines))
index_case_hist_2 <- 
  which(grepl("Case#: ",file_contents_lines))

pull_case_hist <- function(index_case_hist_1,index_case_hist_2 )
 {(file_contents_lines[index_case_hist_1:index_case_hist_2]
    )
    } 

 case_hist <- map2(index_case_hist_1,index_case_hist_2,pull_case_hist)
 case_hist_df <- as.data.frame.character(case_hist)

  # cbind() the vectors,also a good call place to debug from. 
 cases_comp <- cbind(case_nums_df,case_hist_df)

感谢大家的回复。我希望这个解决方案可以帮助未来的人。 :)