如何解决在新列中基于文本值对数据进行分类
我正在尝试采用一个具有状态列的现有数据框,并根据行的状态添加一个名为Region的新列。因此,例如,任何具有“ CA”的行都应归类为“ West”,而任何具有“ IL”的行都应归为Midwest。共有四个地区:西部,南部,中西部和东北。
我曾尝试分别在以下4个代码块中执行此操作:
south <- c("FL","KY","GA","TX","MS","SC","NC","AL","LA","AR","TN","VA","DC","MD","DE","WV") #16 states
south.mdata <- mdata %>% filter(state %in% south) #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())
但这似乎是重复的,而不是最有效的方法。另外,我希望能够按Year和Region分组,这样我就可以跨地区进行比较。
我在实现此功能时遇到了麻烦,想到的第一件事是使用过滤器进行某种if / else循环,但我知道循环并不是真正的R风格。
原始数据如下:
Field.1 ID title description streetaddress city state
1 74 DE074 Cork 'n' Bottle Route 14,1 mile south of town Rehoboth Beach DE
2 75 DE075 Cork 'n' Bottle Route 14,1 mile south of town Rehoboth Beach DE
3 23 DE023 Dog House 1200 DuPont Hwy. Wilmington DE
4 19 DE019 Dog House 1200 DuPont Hwy Wilmington DE
5 26 DE026 Dog House 1200 Dupont Wilmington DE
6 65 DE065 Henlopen Hotel Bar Boardwalk & Surf Rehoboth Beach DE
amenityfeatures type Year notes lon lat
1 (M),(R) Restaurant 1977 <NA> -75.07601 38.72095
2 (M),(R) Restaurant 1976 <NA> -75.07601 38.72095
3 (M),(R) Restaurant 1975 <NA> -75.58243 39.68839
4 (M),(R) Restaurant 1976 <NA> -75.58243 39.68839
5 (M),(R) Restaurant 1974 <NA> -75.58723 39.76705
6 (M) Bars/Clubs,Hotel 1972 <NA> -75.07712 38.72280
status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3 Google Verified Location
4 Google Verified Location
5 Google Verified Location
6 Verified Location
我想添加一个称为“ Region”的新列,该列将遍历每一行,查看状态,然后向Region添加一个值。
任何有关执行此类操作的正确语法的建议,将不胜感激!非常感谢!
解决方法
这是Gregor的评论建议的解决方案的摘要。
library(tidyverse)
orig_data <-
tribble(~ID,~state,1,"CA",2,"FL",3,"DE")
region_lookup <-
tribble(~state,~region,"west","south","DE","south")
left_join(orig_data,region_lookup)
#> Joining,by = "state"
#> # A tibble: 3 x 3
#> ID state region
#> <dbl> <chr> <chr>
#> 1 1 CA west
#> 2 2 FL south
#> 3 3 DE south
由reprex package(v0.3.0)于2020-11-02创建
,最简单的解决方案是联接。因此,您需要一个具有所有状态为区域的data.frame / tibble。幸运的是,数据已经在基数R中:
library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb,state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>%
dplyr::left_join(state_region,by = c("state" = "state.abb"))
现在,您应该有了一个新列“ state.region”,可以将其分组。请注意,状态必须为大写。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。