在新列中基于文本值对数据进行分类

如何解决在新列中基于文本值对数据进行分类

我正在尝试采用一个具有状态列的现有数据框，并根据行的状态添加一个名为Region的新列。因此，例如，任何具有“ CA”的行都应归类为“ West”，而任何具有“ IL”的行都应归为Midwest。共有四个地区：西部，南部，中西部和东北。

我曾尝试分别在以下4个代码块中执行此操作：

south <- c("FL","KY","GA","TX","MS","SC","NC","AL","LA","AR","TN","VA","DC","MD","DE","WV") #16 states
south.mdata <- mdata %>% filter(state %in% south)       #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())

但这似乎是重复的，而不是最有效的方法。另外，我希望能够按Year和Region分组，这样我就可以跨地区进行比较。

我在实现此功能时遇到了麻烦，想到的第一件事是使用过滤器进行某种if / else循环，但我知道循环并不是真正的R风格。

原始数据如下：

 Field.1    ID              title description                  streetaddress           city state
1      74 DE074    Cork 'n' Bottle             Route 14,1 mile south of town Rehoboth Beach    DE
2      75 DE075    Cork 'n' Bottle             Route 14,1 mile south of town Rehoboth Beach    DE
3      23 DE023          Dog House                           1200 DuPont Hwy.     Wilmington    DE
4      19 DE019          Dog House                            1200 DuPont Hwy     Wilmington    DE
5      26 DE026          Dog House                                1200 Dupont     Wilmington    DE
6      65 DE065 Henlopen Hotel Bar                           Boardwalk & Surf Rehoboth Beach    DE
  amenityfeatures             type Year notes       lon      lat
1         (M),(R)       Restaurant 1977  <NA> -75.07601 38.72095
2         (M),(R)       Restaurant 1976  <NA> -75.07601 38.72095
3         (M),(R)       Restaurant 1975  <NA> -75.58243 39.68839
4         (M),(R)       Restaurant 1976  <NA> -75.58243 39.68839
5         (M),(R)       Restaurant 1974  <NA> -75.58723 39.76705
6             (M) Bars/Clubs,Hotel 1972  <NA> -75.07712 38.72280
                                                                      status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3                                                   Google Verified Location
4                                                   Google Verified Location
5                                                   Google Verified Location
6                                                          Verified Location

我想添加一个称为“ Region”的新列，该列将遍历每一行，查看状态，然后向Region添加一个值。

任何有关执行此类操作的正确语法的建议，将不胜感激！非常感谢！

解决方法

这是Gregor的评论建议的解决方案的摘要。

library(tidyverse)

orig_data <- 
  tribble(~ID,~state,1,"CA",2,"FL",3,"DE")

region_lookup <- 
  tribble(~state,~region,"west","south","DE","south")

left_join(orig_data,region_lookup)
#> Joining,by = "state"
#> # A tibble: 3 x 3
#>      ID state region
#>   <dbl> <chr> <chr> 
#> 1     1 CA    west  
#> 2     2 FL    south 
#> 3     3 DE    south

^{由reprex package（v0.3.0）于2020-11-02创建}

最简单的解决方案是联接。因此，您需要一个具有所有状态为区域的data.frame / tibble。幸运的是，数据已经在基数R中：

library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb,state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>% 
  dplyr::left_join(state_region,by = c("state" = "state.abb"))

现在，您应该有了一个新列“ state.region”，可以将其分组。请注意，状态必须为大写。

在新列中基于文本值对数据进行分类

如何解决在新列中基于文本值对数据进行分类

解决方法

相关推荐