如何解决如何系统地从教科书中提取数据
{已编辑} 大家好!
我正在尝试从教科书 (pdf) 中系统地提取数据。由于此任务不容易转化为可重现的示例,因此我提供了书中的 2 页作为示例 here。这两页包含物种科学名称列表(genus species)和一系列 2 字符代码。我想从提供的 2 页示例中提取所有物种的学名及其代码。
到目前为止,我已经能够非常可靠地恢复科学名称,但是代码并没有像我想要的那样提取:
library(pdftools)
library(tidyverse)
plants <- pdf_text("World_Checklist_of_Useful_Plant_Species_2020-pages-12-13.pdf") %>%
str_split("\n") # splitting up the document by pages: result is a list of length = # pages (689)
species_full <- list()
taxa_full <- list()
use_full <- list()
for(i in 1:length(plants)){
# for loop to search for species names across all subsetted pages
species_full[[i]] <- plants[[i]] %>%
str_extract("[A-Z]+[a-z]+ [a-z]+\\b") # extracting words with upper and lower case letters between margins and abbr. words
use_full[[i]] <- plants[[i]] %>%
str_extract("(?<=\\|).+(?=\\|)") %>% # extracting use codes
str_split("\n") %>%
str_extract_all("[A-Z]+[A-Z]")
}
species_full_df <- species_full %>%
unlist() %>% # unlisting
as.data.frame() %>%
drop_na() %>%
rename(species = ".") %>%
filter(!species %in% c("Checklist of","Database developed")) # removing artifacts from page headers
use_full_df <- use_full %>%
unlist() %>% # unlisting
as.data.frame() %>%
rename(code = ".") %>%
filter(!code == "<NA>") %>%
as.data.frame()
从此代码中,我在 species_full_df
中获得以下内容:
> head(species_full_df)
species
1 Encephalartos cupidus
2 Encephalartos cycadifolius
3 Encephalartos eugene
4 Encephalartos friderici
5 Encephalartos heenanii
6 Cycas apoa
(注意顺序没有保留,但大部分物种名称都在那里)
我从 use_full_df
获得这些结果:
> head(use_full_df)
code
1 RBG
2 EU
3 EU
4 MA
5 ME
6 ME
问题:提取是抓取 3 个字符的代码(我只想提取 2 个字符的使用代码),并且每行只返回一个代码(许多物种有多个代码)。
你能建议如何改进这个过程吗?大概我对正则表达式的使用令人厌恶。
先谢谢你!
-亚历克斯。
解决方法
我会以不同的方式解决它。
首先,我会依赖包 tabulizer
,它可以奇迹般地将 pdf 中的列解析为线串信号。
然后,我将原始行转换为 tibble/data.frame 以矢量化转换。
library(tabulizer)
library(splitstackshape)
library(tidyverse)
text_plants <- tabulizer::extract_text(file = "World_Checklist_of_Useful_Plant_Species_2020-pages-12-13.pdf")
df_plants <-
read.delim(file = textConnection(text_plants),header = FALSE) %>% as_tibble() %>% #as_tibble is optional,but helps a lot for exploring the results of the read.delim and the following mutations.
filter(grepl("^\\s?(World.Checklist.of.Useful.Plant|m.diazgranados@kew.org|Page *\\d+ of \\d+|\\s*$)",V1) == FALSE) %>% # Optional. Removes the first and final with headers and footers.
mutate(V1 = trimws(V1),is_metadata = grepl('^\\s?\\d+.*[|]',V1),#Starts by checking those lines that have metadata,and which are always below a plant
is_plant = lead(is_metadata),#Identifies those lines with the plant name,which seems to be always above a metadata line
plant_metadata = if_else(is_plant == TRUE,true = trimws(lead(V1)),false = NA_character_)) %>% #moves the metadata signal into the same row but different variable of the plant signal.
filter(is_plant == TRUE) %>% # Removes all lines not lsiting a plant.
rename(plant = V1) %>%
mutate(usage_codes = str_extract(string = plant_metadata,pattern = "(?<=\\|).+(?=\\|)") %>% trimws()) %>% # Extractx the "usage codes"
select(plant,usage_codes) %>%
splitstackshape::cSplit(splitCols = "usage_codes",sep = " ",direction = "long") %>% # Extracts the usage code into a tidy table with plats as ID
filter(!is.na(usage_codes)) %>%
mutate(exists = TRUE) %>%
pivot_wider(id_cols = plant,names_from = usage_codes,values_from = exists,values_fill = FALSE) # pivots the tidy table into a wide format.
df_plants
# A tibble: 114 x 10
plant ME HF PO SU EU GS MA IF AF
<chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 Cycas apoa K.D.Hill TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 Cycas circinalis L. TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
3 Cycas inermis Lour. TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 Cycas media R.Br. TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5 Cycas micronesica K.D.Hill TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 Cycas pectinata Buch.-Ham. TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
7 Cycas revoluta Thunb. TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
8 Cycas rumphii Miq. TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
9 Cycas siamensis Miq. TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
10 Cycas taiwaniana Carruth. FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
# … with 104 more rows
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。