如何根据R中的多个条件将变量解析为多列？

如何解决如何根据R中的多个条件将变量解析为多列？

我是 R 的新手，所以请耐心等待。我正在查看监禁数据，并且有一个变量 conviction，它是一个看起来像这样的杂乱字符串：

[1] "Ct. 1: Conspiracy to distribute"                                                                         
[2] "Aggravated Assault"                                                                                      
[3] "Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture"                                      
[4] "Ct. 1-6: Human Trafficking; Cts. 7,8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling"

理想情况下，我想做两件事。首先，我想将 Ct. 解析为多列。对于前三行，数据如下所示：

     convictions                              conviction_1                      conviction_2                    
[1,] "Ct. 1: Conspiracy to distribute"        "Conspiracy to distribute"        NA                   
[2,] "Aggravated Assault"                     "Aggravated Assault"              NA                   
[3,] "Ct. 1: Possession of prohibited object" "Possession of prohibited object" "criminal forfeiture"

但是当我到达第三行时事情变得很麻烦，因为我想将字符串的第一部分 (Ct. 1-6: Human Trafficking) 解析为 6 列，然后将 Ct. 7,8: Unlawful contact 解析为另外 2 列。

第二部分是然后我想生成一个变量 convictions_total，它会在 conviction 之后的 Ct: 字符串中找到最高数字。对于我在这里包含的三个示例条目，convictions_total 看起来像：

[1]  1  2 36

这是我用来解析一个更直接的字符串变量的代码，但我不确定如何为这个变量调整它：

cols <- data.frame(str_split_fixed(data$convictions`,",Inf))
colnames(cols) <- paste0("conviction_",rep(1:length(cols)))
data <- cbind(data,cols)

先谢谢你！

解决方法

以下适用于您的示例，无需使用太多正则表达式，主要是数字提取或其他字符串检测：

library(stringr)
library(magrittr)
library(purrr)
library(plyr)

convictions_total <- sapply(stringr::str_extract_all(convictions,"\\d+"),function(x) max(as.numeric(x),1))
convictions_split <- strsplit(convictions,";")


reps <- lapply(convictions_split,FUN = function(x) {
    sapply(x,FUN = function(i) {
      num <- paste(stringr::str_extract_all(i,"[\\d+\\-,]")[[1]],collapse = "")
      # "-" indicates a range: take largest value
      if (stringr::str_detect(num,"-")){
        stringr::str_extract_all(num,"\\d+") %>% 
          unlist() %>% 
          as.numeric() %>%
          max() %>%  
          return()
      # "," indicates a sequence: get length of sequence
      } else if(stringr::str_detect(num,",")){
        stringr::str_count(num,") + 1 %>% 
          as.numeric() %>%
          return()
      # otherwise return 1
      } else {
        return(1)
      }
    })
  })

convictions_str <- lapply(convictions_split,function(x) gsub(".*\\d:?\\s(.*)$","\\1",x))

df <- purrr::map2(convictions_str,reps,rep) %>% 
  plyr::ldply(rbind) %>% 
  cbind(convictions_total,.) %>% 
  data.frame() %>% 
  dplyr::rename_with(~ gsub("X","conviction_",.x),starts_with("X"))

输出

  convictions_total                    conviction_1        conviction_2      conviction_3
1                 1        Conspiracy to distribute                <NA>              <NA>
2                 1              Aggravated Assault                <NA>              <NA>
3                 2 Possession of prohibited object criminal forfeiture              <NA>
4                36               Human Trafficking   Human Trafficking Human Trafficking
       conviction_4      conviction_5      conviction_6     conviction_7     conviction_8
1              <NA>              <NA>              <NA>             <NA>             <NA>
2              <NA>              <NA>              <NA>             <NA>             <NA>
3              <NA>              <NA>              <NA>             <NA>             <NA>
4 Human Trafficking Human Trafficking Human Trafficking Unlawful contact Unlawful contact
           conviction_9 conviction_10
1                  <NA>          <NA>
2                  <NA>          <NA>
3                  <NA>          <NA>
4 Involuntary Servitude     Smuggling

数据

convictions <- c("Ct. 1: Conspiracy to distribute","Aggravated Assault","Ct. 1: Possession of prohibited object; Ct.: 2 criminal forfeiture","Ct. 1-6: Human Trafficking; Cts. 7,8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling")

工作原理

convictions_total 通过使用 stringr::str_extract_all 提取 convictions 中每一行的所有数字很容易提取。这将返回一个向量列表。 sapply 然后从列表中的每个向量中取最大值并返回一个向量。
reps 是一个列表，其中的元素对应于 convictions 的元素，它存储了一个数字向量，表示每个定罪计数重复的次数。

代码首先将 convictions 拆分为向量列表，其中向量包含以下提取的信息：数字 (\\d+)、破折号 (\\-) 和逗号 ({ {1}}）。该逻辑通过搜索这些字符串提取来工作：

首先，如果它在定罪计数中找到 ,，则表示一个范围，并再次取最大值。例如，"-" 将返回 "Ct. 1-6: Human Trafficking"。
接下来，如果它没有找到 6，而是 "-" 表示计数分隔符。所以它计算逗号分隔符的数量并加一个。例如 "," 将返回 "Cts. 7,8 Unlawful contact"
假定其他所有内容仅重复一次，因为它不是一个顺序列表或范围。

reps [[1]] Ct. 1: Conspiracy to distribute 1 [[2]] Aggravated Assault 1 [[3]] Ct. 1: Possession of prohibited object Ct.: 2 criminal forfeiture 1 1 [[4]] Ct. 1-6: Human Trafficking Cts. 7,8 Unlawful contact Ct. 11: Involuntary Servitude 6 2 1 Ct. 36: Smuggling 1 只是提取实际的定罪信息。例如，代码将从 convictions_str 中提取所有定罪的 "Ct. 1: Conspiracy to distribute" 等。

"Conspiracy to distribute"

此时[[1]] [1] "Conspiracy to distribute" [[2]] [1] "Aggravated Assault" [[3]] [1] "Possession of prohibited object" "criminal forfeiture" [[4]] [1] "Human Trafficking" "Unlawful contact" "Involuntary Servitude" [4] "Smuggling"和reps有一个相关的结构：

convictions_str 应该重复 convictions_str[[1]][1] 次
reps[[1]][1] 应该重复 convictions_str[[1]][2] 次

reps[[1]][2] 利用此结构，使用 purrr::map2 函数通过存储在 rep 中的值重复 convictions_str 中的元素并输出一个列表。 reps 行将此列表填充为 plyr::ldply，因为并非每个人都有相同数量的定罪。 NA 添加列 cbind，convictions_total 更改列名称。

在经历了两天的兔子洞之后，我找到了@LMc 代码的整洁版本，最终效果更好，因为调用 plyr 会弄乱我编写的其他代码：

test_data <- 
  tibble(id = 1:5,convictions = c("Ct. 1: Conspiracy to distribute","Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture",8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling 50 grams","Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28:  Money Laundering"))
test_data <- test_data %>% 
  mutate(c2 = convictions) #this just duplicates the original variable convictions because I want to preserve it

test_data <- test_data %>%
  separate_rows(c2,sep = ";") %>%
  mutate(c2 = str_remove(c2,"Ct(s)?(\\. )(\\d|-|:|,|\\s)+")) %>%
  group_by(id) %>%
  mutate(conviction_number = paste0("c_",row_number())) %>%
  pivot_wider(values_from = c2,names_from = conviction_number) 


test_data <- test_data %>% 
  mutate(c2 = convictions) #again,just preserving the original variable

test_data <- test_data %>%
  separate_rows(c2,sep = ";") %>% 
  mutate(total_counts = as.numeric(ifelse(is.na(str_extract(c2,"((?<=\\-)\\d+)")),str_extract(c2,"((?<=\\-)\\d+)")))) %>% 
  mutate(total_counts = ifelse(is.na(total_counts),1,total_counts)) %>% 
  group_by(id) %>% 
  slice_max(total_counts)

产生以下数据帧：

     id convictions                                                  c_1                c_2           c_3            c_4          c2                 total_counts
  <int> <chr>                                                        <chr>              <chr>         <chr>          <chr>        <chr>                     <dbl>
1     1 Ct. 1: Conspiracy to distribute                              Conspiracy to dis~  NA            NA             NA          "Ct. 1: Conspirac~            1
2     2 Aggravated Assault                                           Aggravated Assault  NA            NA             NA          "Aggravated Assau~            1
3     3 Ct. 1: Possession of prohibited object; Ct. 2: criminal for~ Possession of pro~ " criminal f~  NA             NA          " Ct. 2: criminal~            2
4     4 Ct. 1-6: Human Trafficking; Cts. 7,8 Unlawful contact; Ct.~ Human Trafficking  " Unlawful c~ " Involuntary~ " Smuggling~ " Ct. 36: Smuggli~           36
5     5 Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28:  Money ~ Conspiracy         " Wire Fraud" " Money Laund~  NA          " Cts. 8-28:  Mon~           28

第一段代码将计数解析为单独的行，然后返回到 c_ 列。第二个代码块执行相同的解析，但随后查看每个条目以解析数字，而不是单词。

//d+ 查找任何数字，但结果证明我有看起来像 Cts. 2-7 的数据，其中我想要值 7，而不是 2。

((?<=\\-)\\d+)")) 查找连字符，然后解析它后面的数字。如果没有连字符，则默认返回 \\d+。

最后，slice_max 根据 total_counts 的最大值将数据折叠为每个 ID 1 个条目。