R：如何使用索引列聚合数据框？

如何解决R：如何使用索引列聚合数据框？

我有一个如下所示的数据框：

head(test_df,n =15)
# print the first 15rows of the dataframe

               value frequency                        index
1  -2.90267705917358         1                            1
2  -2.90254878997803         1                            1
3  -2.90252590179443         1                            1
4  -2.90219354629517         1                            1
5  -2.90201354026794         1                            1
6   -2.9016375541687         1                            1
7  -2.90107154846191         1                            1
8  -2.90089440345764         1                            1
9  -2.89996957778931         1                            1
10 -2.89970088005066         1                            1
11 -2.89928865432739         1                            2
12 -2.89920520782471         1                            2
13 -2.89907360076904         1                            2
14 -2.89888191223145         1                            2
15  -2.8988630771637         1                            2

数据框有 3 列和 61819 行。为了聚合数据框，我想获得具有相同“索引”的所有行的“值”和“频率”列的平均值。

我已经找到了一些有用的链接，请参阅：

https://www.r-bloggers.com/2018/07/how-to-aggregate-data-in-r/

Which is the simplest way to aggregate rows (sum) by columns values the following type of data frame on R?

但是，我还没有解决问题。

test_df_ag <- stats::aggregate(test_df[1:2],by = test_df[3],FUN = 'mean')
# aggregate the dataframe based on the 'index' column (build the mean)

   index value frequency
1      1    NA         1
2      2    NA         1
3      3    NA         1
4      4    NA         1
5      5    NA         1
6      6    NA         1
7      7    NA         1
8      8    NA         1
9      9    NA         1
10    10    NA         1
11    11    NA         1
12    12    NA         1
13    13    NA         1
14    14    NA         1
15    15    NA         1

由于我只是获得列“值”的 NA 值，我想知道它是否可能只是数据类型问题？！但是，当我尝试转换数据类型时也失败了.. .

base::typeof(test_df$value)
# query the data type of the 'value' column
[1] "integer"

解决方法

1.这是一个基本的 R 解决方案。

aggregate(cbind(value,frequency) ~ index,data = test_df,FUN = mean)
#  index     value frequency
#1     1 -2.901523         1
#2     2 -2.899062         1

2.还有一个简单的 dplyr 解决方案。

library(dplyr)

test_df %>%
  group_by(index) %>%
  summarize(across(1:2,mean))
## A tibble: 2 x 3
#  index value frequency
#* <int> <dbl>     <dbl>
#1     1 -2.90         1
#2     2 -2.90         1

数据

test_df <-
structure(list(value = c(-2.90267705917358,-2.90254878997803,-2.90252590179443,-2.90219354629517,-2.90201354026794,-2.9016375541687,-2.90107154846191,-2.90089440345764,-2.89996957778931,-2.89970088005066,-2.89928865432739,-2.89920520782471,-2.89907360076904,-2.89888191223145,-2.8988630771637),frequency = c(1L,1L,1L),index = c(1L,2L,2L)),class = "data.frame",row.names = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15"))

使用 data.table

library(data.table)
setDT(test_df)[,lapply(.SD,mean),by = index,.SDcols = 1:2]

试试 tidyverse。 test_summary <- test_df %>% group_by(index) %>% summarise(n=n(),mean_value=mean(value,na.rm=T),mean_frequency=mean(frequency,na.rm=T))。

哦，当然，您应该确保检查了数据的质量，并了解数据集中任何 NA 的假设和原因。