如何解决用stringr提取更大的字符数据体?
我正在努力从大约 1000 个 pdf 文件中抓取文本数据。我设法将它们全部导入 R-studio,使用 str_subset
和 str_extract_all
获取我需要的较小属性。该项目的主要目标是抓取案例历史叙述数据。这些是自然语言的段落,以在所有单个文档中标准化的独特词为界。请参阅下面的复制示例。
有没有办法可以使用这两个独特的词(“CASE HISTORY & INVESTIGATOR:”)来绑定我想要提取的文本?如果没有,我可以采取什么样的方法从每个报告中提取我需要的叙述性数据?
text_data <- list("ES SPRINGFEILD POLICE DE FARRELL #789\n NOTIFIED DATE TIME OFFICER\nMARITAL STATUS: UNKNowN\nIDENTIFIED BY: H. POIROT AT: SCENE DATE: 01/02/1895\nFINGERPRINTS TAKEN BY DATE\n YES NO OBIWAN KENOBI 01/02/1895\n
SPRINGFEILD\n CASE#: 012-345-678\n ABC NOTIFIED: ABC DATE:\n ABC OFFICER: NATURE:\nCASE HISTORY\n This is a string. There are many strings like it,but this one is mine. To be more specific,this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n examineR / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1,2 even\n the next capitalized word,investigator with a colon,is a unique word where the string stops.\nINVESTIGATOR: HERCULE POIROT \n")
这是预期的输出。
output <- list("This is a string. There are many strings like it,is a unique word where the string stops.")
非常感谢您的帮助!
解决方法
一种快速的方法是使用 gsub
和正则表达式来替换所有内容,包括 CASE HISTORY ('^.*CASE HISTORY'
) 以及 INVESTIGATOR: ('INVESTIGATOR:.*'
) 之后的所有内容。剩下的就是这两个匹配项之间的文本。
gsub('INVESTIGATOR:.*','',gsub('^.*CASE HISTORY',text_data))
[1] "\n This is a string. There are many strings like it,but this one is mine. To be more specific,this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1,2 even\n the next capitalized word,investigator with a colon,is a unique word where the string stops.\n"
,
经过深思熟虑,我得出了一个我认为值得分享的解决方案,所以我们开始吧:
# unlist text_data
file_contents_unlist <-
paste(unlist(text_data),collapse = " ")
# read lines,squish for good measure.
file_contents_lines <-
file_contents_unlist%>%
readr::read_lines() %>%
str_squish()
# Create indicies in the lines of our text data based upon regex grepl
# functions,be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \\d+[-]\\d+)",file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \\d+[-]\\d+)",file_contents_lines))
# function basically states,"give me back whatever's in those indices".
pull_case_num <-
function(index_case_num_1,index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
)
}
# map2() to iterate.
case_nums <- map2(index_case_num_1,index_case_num_2,pull_case_num)
# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)
# Repeat pattern for other vectors as needed.
index_case_hist_1 <-
which(grepl("CASE HISTORY",file_contents_lines))
index_case_hist_2 <-
which(grepl("Case#: ",file_contents_lines))
pull_case_hist <- function(index_case_hist_1,index_case_hist_2 )
{(file_contents_lines[index_case_hist_1:index_case_hist_2]
)
}
case_hist <- map2(index_case_hist_1,index_case_hist_2,pull_case_hist)
case_hist_df <- as.data.frame.character(case_hist)
# cbind() the vectors,also a good call place to debug from.
cases_comp <- cbind(case_nums_df,case_hist_df)
感谢大家的回复。我希望这个解决方案可以帮助未来的人。 :)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。