如何解决R - 将 i 个逗号分隔 ID 的字符向量分解为数据帧的 i 个离散向量
数据框 df
包含两个字符向量。这是前 10 行:
rowid codes_raw
a 15-1132,15-1133
b 21-1091,21-1094,21-1099
c 25-9011,25-9021,25-9031,25-9099
d 31-9093,31-9099
e 33-9092,33-9099
f 37-2011,37-2019
g 39-4011,39-4021
h 47-5051,47-5099
i 49-2094,49-2095
j 49-9041
df$codes_raw
包含给定行的 1 到 i 个离散标识符。这些标识符需要分布在同一数据帧中的 i 个新向量中。结果应该是这样的:
rowid codes_raw code_1 code_2 code_3 code_4
a 15-1132,15-1133 15-1132 15-1133
b 21-1091,21-1099 21-1091 21-1094 21-1099
c 25-9011,25-9099 25-9011 25-9021 25-9031 25-9099
d 31-9093,31-9099 31-9093 31-9099
e 33-9092,33-9099 33-9092 33-9099
f 37-2011,37-2019 37-2011 37-2019
g 39-4011,39-4021 39-4011 39-4021
h 47-5051,47-5099 47-5051 47-5099
i 49-2094,49-2095 49-2094 49-2095
j 49-9041 49-9041
我当前的解决方案涉及对字符串的每一部分进行单独的 if_else()
调用,这很笨拙。例如:
df$code_2 <- if_else(
grepl(',',df$codes_raw),sub('.*,\\s*','',' ')
我还希望该解决方案适用于 df$codes_raw
中有多达 20 个逗号的情况。我正在寻找更优雅、更有活力的替代品。
解决方法
使用'separate()'
library(tidyr)
lengths <- max(sapply(strsplit(df$codes_raw,split= ","),length))
names <- sapply(seq(lengths),function(x) paste0("code_",x))
df %>%
separate(codes_raw,into = names,sep = ",",remove = FALSE)
rowid codes_raw code_1 code_2 code_3 code_4
1 a 15-1132,15-1133 15-1132 15-1133 <NA> <NA>
2 b 21-1091,21-1094,21-1099 21-1091 21-1094 21-1099 <NA>
3 c 25-9011,25-9021,25-9031,25-9099 25-9011 25-9021 25-9031 25-9099
4 d 31-9093,31-9099 31-9093 31-9099 <NA> <NA>
5 e 33-9092,33-9099 33-9092 33-9099 <NA> <NA>
6 f 37-2011,37-2019 37-2011 37-2019 <NA> <NA>
7 g 39-4011,39-4021 39-4011 39-4021 <NA> <NA>
8 h 47-5051,47-5099 47-5051 47-5099 <NA> <NA>
9 i 49-2094,49-2095 49-2094 49-2095 <NA> <NA>
10 j 49-9041 49-9041 <NA> <NA> <NA>
,
为了自动输入列名,我建议这样做
library(tidyverse)
df %>%
separate_rows(codes_raw,") %>%
group_by(rowid) %>%
mutate(id_cols = row_number()) %>%
pivot_wider(rowid,names_from = id_cols,values_from = codes_raw,names_prefix = "code_") %>%
ungroup()
# A tibble: 10 x 5
rowid code_1 code_2 code_3 code_4
<chr> <chr> <chr> <chr> <chr>
1 a 15-1132 15-1133 NA NA
2 b 21-1091 21-1094 21-1099 NA
3 c 25-9011 25-9021 25-9031 25-9099
4 d 31-9093 31-9099 NA NA
5 e 33-9092 33-9099 NA NA
6 f 37-2011 37-2019 NA NA
7 g 39-4011 39-4021 NA NA
8 h 47-5051 47-5099 NA NA
9 i 49-2094 49-2095 NA NA
10 j 49-9041 NA NA NA
或
nm <- paste0("code_",seq_len(max(str_count(df$codes_raw,pattern = ",")) + 1))
df %>%
separate(
codes_raw,into = nm,")
,
您说最大列数是 20,因此有一种方法可以使用包含捕获组的正则表达式(使用 library(namedCapture)
)来做到这一点,例如
rowid <- c("a","b","c","d","e")
codes_raw <- c("15-1132,15-1133","21-1091,21-1099","25-9011,25-9099","31-9093,31-9099","49-9041")
df <- data.frame(rowid,codes_raw)
library(namedCapture)
n = 20 # Max number of columns
pattern <- "^(?P<code_1>\\d+-\\d+)" # Pattern start
for (x in 2:n) { # Add more optional columns
pattern <- paste0(pattern,"(?:\\s*,\\s*(?P<code_",x,">\\d+-\\d+))?")
}
pattern <- paste0(pattern,"$") # End of string anchor added
df1 <- str_match_named(df$codes_raw,pattern) # Extract column data
df1 <- df1[,colSums(df1 != "") != 0] # Remove empty columns
df1 <- cbind(rowid,df1) # Put back the rowid column
输出:
> cbind(rowid,df1)
rowid code_1 code_2 code_3 code_4
[1,] "a" "15-1132" "15-1133" "" ""
[2,] "b" "21-1091" "21-1094" "21-1099" ""
[3,] "c" "25-9011" "25-9021" "25-9031" "25-9099"
[4,] "d" "31-9093" "31-9099" "" ""
[5,] "e" "49-9041" "" "" ""
-
^
- 字符串的开始 -
(?P<code_1>\d+-\d+)
- 一个命名的捕获组,其中code_1
个名称匹配一个或多个数字,-
和一个或多个数字 -
(?:\s*,\s*(?P<code_2>\d+-\d+))?
- 一个可选的逗号序列,用零个或多个空格括起来,然后将“code_2”组匹配 1+ 个数字、-
、1+ 个数字等。
像这样动态执行(创建列名)。这适用于连接在一起的任意数量的字符串
userSession
由 reprex package (v2.0.0) 于 2021 年 5 月 25 日创建
,您可以使用 str_split()
库中的 stringr
拆分列表中的代码,然后将向量列表(长度不等)转换为矩阵,然后使用 mutate()
加入您的原始数据框。下面是一个例子:
#your example data
df<-data.frame(rowid = c("a","e","f","g","h","i","j"),codes_raw = c("15-1132,"33-9092,33-9099","37-2011,37-2019","39-4011,39-4021","47-5051,47-5099","49-2094,49-2095","49-9041"))
library(stringr)
library(dplyr)
#Split codes raw by comma
l<-str_split(df$codes_raw,")
#get length of each code
n.codes <- sapply(l,length)
#find the longest number of codes,and make a sequence from 1 to that number.
seq.max <- seq_len(max(n.codes))
#Fill NAs in blanks as you make a matrix. Convert to dataframe.
codes_in_columns <- t(sapply(l,"[",i = seq.max)) %>%
data.frame(.)
#Set the desired column names.
names(codes_in_columns)<- paste0("code_",seq.max)
#combine original with separated codes
df<-df %>% mutate(codes_in_columns )
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。