来自多列的随机样本

如何解决来自多列的随机样本

我有一个包含多列的数据集，其中每一行代表一个产品，每一列包括对相应产品的一个注释。对于每种产品，我们观察到多个注释，每个注释都存储在其自己的列中。

现在，我想通过以下方式创建两个新的数据集：（1）仅包含一列的数据集，包括多个注释列中x个（数量）注释的随机样本。（2）与（1）相同，但现在我想从每列中抽取相同数量的评论（例如，来自“ comment1”的2条评论和来自“ comment2”的2条评论。

Example data:
commentda = data.frame(product_id = c(1,2,3,4),comment1 = c("Very good","Bad","Would buy it","Zero stars"),comment2 = c("Bad reputation","Good seller","Great service","I will buy it again"))
> 
> commentda
  product_id     comment1            comment2
1          1    Very good      Bad reputation
2          2          Bad         Good seller
3          3 Would buy it       Great service
4          4   Zero stars I will buy it again

解决方法

您可能会获得长格式的数据，这将有助于有效地进行此类操作。

library(dplyr)
n <- 2

long_data <- commentda %>%  tidyr::pivot_longer(cols = starts_with('comment'))

包含随机n条评论

long_data %>% slice_sample(n = n)

在各列中随机包含n条评论。

long_data %>%  group_by(name) %>%  slice_sample(n = n)