微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何根据R中的多个条件将变量解析为多列?

如何解决如何根据R中的多个条件将变量解析为多列?

我是 R 的新手,所以请耐心等待。我正在查看监禁数据,并且有一个变量 conviction,它是一个看起来像这样的杂乱字符串:

[1] "Ct. 1: Conspiracy to distribute"                                                                         
[2] "Aggravated Assault"                                                                                      
[3] "Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture"                                      
[4] "Ct. 1-6: Human Trafficking; Cts. 7,8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling"

理想情况下,我想做两件事。首先,我想将 Ct. 解析为多列。对于前三行,数据如下所示:

     convictions                              conviction_1                      conviction_2                    
[1,] "Ct. 1: Conspiracy to distribute"        "Conspiracy to distribute"        NA                   
[2,] "Aggravated Assault"                     "Aggravated Assault"              NA                   
[3,] "Ct. 1: Possession of prohibited object" "Possession of prohibited object" "criminal forfeiture"

但是当我到达第三行时事情变得很麻烦,因为我想将字符串的第一部分 (Ct. 1-6: Human Trafficking) 解析为 6 列,然后将 Ct. 7,8: Unlawful contact 解析为另外 2 列。

第二部分是然后我想生成一个变量 convictions_total,它会在 conviction 之后的 Ct: 字符串中找到最高数字。对于我在这里包含的三个示例条目,convictions_total 看起来像:

[1]  1  2 36

这是我用来解析一个更直接的字符串变量的代码,但我不确定如何为这个变量调整它:

cols <- data.frame(str_split_fixed(data$convictions`,",Inf))
colnames(cols) <- paste0("conviction_",rep(1:length(cols)))
data <- cbind(data,cols)

先谢谢你!

解决方法

以下适用于您的示例,无需使用太多正则表达式,主要是数字提取或其他字符串检测:

library(stringr)
library(magrittr)
library(purrr)
library(plyr)

convictions_total <- sapply(stringr::str_extract_all(convictions,"\\d+"),function(x) max(as.numeric(x),1))
convictions_split <- strsplit(convictions,";")


reps <- lapply(convictions_split,FUN = function(x) {
    sapply(x,FUN = function(i) {
      num <- paste(stringr::str_extract_all(i,"[\\d+\\-,]")[[1]],collapse = "")
      # "-" indicates a range: take largest value
      if (stringr::str_detect(num,"-")){
        stringr::str_extract_all(num,"\\d+") %>% 
          unlist() %>% 
          as.numeric() %>%
          max() %>%  
          return()
      # "," indicates a sequence: get length of sequence
      } else if(stringr::str_detect(num,",")){
        stringr::str_count(num,") + 1 %>% 
          as.numeric() %>%
          return()
      # otherwise return 1
      } else {
        return(1)
      }
    })
  })

convictions_str <- lapply(convictions_split,function(x) gsub(".*\\d:?\\s(.*)$","\\1",x))

df <- purrr::map2(convictions_str,reps,rep) %>% 
  plyr::ldply(rbind) %>% 
  cbind(convictions_total,.) %>% 
  data.frame() %>% 
  dplyr::rename_with(~ gsub("X","conviction_",.x),starts_with("X"))

输出

  convictions_total                    conviction_1        conviction_2      conviction_3
1                 1        Conspiracy to distribute                <NA>              <NA>
2                 1              Aggravated Assault                <NA>              <NA>
3                 2 Possession of prohibited object criminal forfeiture              <NA>
4                36               Human Trafficking   Human Trafficking Human Trafficking
       conviction_4      conviction_5      conviction_6     conviction_7     conviction_8
1              <NA>              <NA>              <NA>             <NA>             <NA>
2              <NA>              <NA>              <NA>             <NA>             <NA>
3              <NA>              <NA>              <NA>             <NA>             <NA>
4 Human Trafficking Human Trafficking Human Trafficking Unlawful contact Unlawful contact
           conviction_9 conviction_10
1                  <NA>          <NA>
2                  <NA>          <NA>
3                  <NA>          <NA>
4 Involuntary Servitude     Smuggling

数据

convictions <- c("Ct. 1: Conspiracy to distribute","Aggravated Assault","Ct. 1: Possession of prohibited object; Ct.: 2 criminal forfeiture","Ct. 1-6: Human Trafficking; Cts. 7,8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling")

工作原理

  1. convictions_total 通过使用 stringr::str_extract_all 提取 convictions 中每一行的所有数字很容易提取。这将返回一个向量列表。 sapply 然后从列表中的每个向量中取最大值并返回一个向量。
  2. reps 是一个列表,其中的元素对应于 convictions 的元素,它存储了一个数字向量,表示每个定罪计数重复的次数。

代码首先将 convictions 拆分为向量列表,其中向量包含以下提取的信息:数字 (\\d+)、破折号 (\\-) 和逗号 ({ {1}})。该逻辑通过搜索这些字符串提取来工作:

  • 首先,如果它在定罪计数中找到 ,,则表示一个范围,并再次取最大值。例如,"-" 将返回 "Ct. 1-6: Human Trafficking"
  • 接下来,如果它没有找到 6,而是 "-" 表示计数分隔符。所以它计算逗号分隔符的数量并加一个。例如 "," 将返回 "Cts. 7,8 Unlawful contact"
  • 假定其他所有内容仅重复一次,因为它不是一个顺序列表或范围。
2
  1. reps [[1]] Ct. 1: Conspiracy to distribute 1 [[2]] Aggravated Assault 1 [[3]] Ct. 1: Possession of prohibited object Ct.: 2 criminal forfeiture 1 1 [[4]] Ct. 1-6: Human Trafficking Cts. 7,8 Unlawful contact Ct. 11: Involuntary Servitude 6 2 1 Ct. 36: Smuggling 1 只是提取实际的定罪信息。例如,代码将从 convictions_str 中提取所有定罪的 "Ct. 1: Conspiracy to distribute" 等。
"Conspiracy to distribute"

此时[[1]] [1] "Conspiracy to distribute" [[2]] [1] "Aggravated Assault" [[3]] [1] "Possession of prohibited object" "criminal forfeiture" [[4]] [1] "Human Trafficking" "Unlawful contact" "Involuntary Servitude" [4] "Smuggling" reps有一个相关的结构:

  • convictions_str 应该重复 convictions_str[[1]][1]
  • reps[[1]][1] 应该重复 convictions_str[[1]][2]
  1. reps[[1]][2] 利用此结构,使用 purrr::map2 函数通过存储在 rep 中的值重复 convictions_str 中的元素并输出一个列表。 reps 行将此列表填充为 plyr::ldply,因为并非每个人都有相同数量的定罪。 NA 添加列 cbindconvictions_total 更改列名称。
,

在经历了两天的兔子洞之后,我找到了@LMc 代码的整洁版本,最终效果更好,因为调用 plyr 会弄乱我编写的其他代码:

test_data <- 
  tibble(id = 1:5,convictions = c("Ct. 1: Conspiracy to distribute","Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture",8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling 50 grams","Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28:  Money Laundering"))
test_data <- test_data %>% 
  mutate(c2 = convictions) #this just duplicates the original variable convictions because I want to preserve it

test_data <- test_data %>%
  separate_rows(c2,sep = ";") %>%
  mutate(c2 = str_remove(c2,"Ct(s)?(\\. )(\\d|-|:|,|\\s)+")) %>%
  group_by(id) %>%
  mutate(conviction_number = paste0("c_",row_number())) %>%
  pivot_wider(values_from = c2,names_from = conviction_number) 


test_data <- test_data %>% 
  mutate(c2 = convictions) #again,just preserving the original variable

test_data <- test_data %>%
  separate_rows(c2,sep = ";") %>% 
  mutate(total_counts = as.numeric(ifelse(is.na(str_extract(c2,"((?<=\\-)\\d+)")),str_extract(c2,"((?<=\\-)\\d+)")))) %>% 
  mutate(total_counts = ifelse(is.na(total_counts),1,total_counts)) %>% 
  group_by(id) %>% 
  slice_max(total_counts) 

产生以下数据帧:

     id convictions                                                  c_1                c_2           c_3            c_4          c2                 total_counts
  <int> <chr>                                                        <chr>              <chr>         <chr>          <chr>        <chr>                     <dbl>
1     1 Ct. 1: Conspiracy to distribute                              Conspiracy to dis~  NA            NA             NA          "Ct. 1: Conspirac~            1
2     2 Aggravated Assault                                           Aggravated Assault  NA            NA             NA          "Aggravated Assau~            1
3     3 Ct. 1: Possession of prohibited object; Ct. 2: criminal for~ Possession of pro~ " criminal f~  NA             NA          " Ct. 2: criminal~            2
4     4 Ct. 1-6: Human Trafficking; Cts. 7,8 Unlawful contact; Ct.~ Human Trafficking  " Unlawful c~ " Involuntary~ " Smuggling~ " Ct. 36: Smuggli~           36
5     5 Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28:  Money ~ Conspiracy         " Wire Fraud" " Money Laund~  NA          " Cts. 8-28:  Mon~           28

第一段代码将计数解析为单独的行,然后返回到 c_ 列。第二个代码块执行相同的解析,但随后查看每个条目以解析数字,而不是单词。

//d+ 查找任何数字,但结果证明我有看起来像 Cts. 2-7 的数据,其中我想要值 7,而不是 2。

((?<=\\-)\\d+)")) 查找连字符,然后解析它后面的数字。如果没有连字符,则默认返回 \\d+

最后,slice_max 根据 total_counts 的最大值将数据折叠为每个 ID 1 个条目。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。