如何知道在下一个订单的交货/接收之前下一个订单的客户？在R中

如何解决如何知道在下一个订单的交货/接收之前下一个订单的客户？在R中

我有一个拥有两个日期的大型数据库。例如。取得超级市场数据（http://www.tableau.com/sites/default/files/training/global_superstore.zip）“订单”表。

一个日期为订购日期，另一个日期为装运/交货日期（假设为交货日期）。我想知道下一个订单而无需等待任何先前订单的发货/交付的那些客户的所有订单的详细信息。

例如标识为“ ZC-21910”的客户于2014年6月12日下了ID为CA-2014-133928的订单，该订单于2014年6月18日发货。但是，同一客户在6月13日下了ID为“ IT-2014-3511710”的下订单2014年，即2014年6月18日之前（先前订单之一的发货日期）。

最好将所有此类订单（订单ID）过滤到单独的向量中。

如何在R中做到这一点？还是在Tableau中？

示例数据集

> dput(df)
structure(list(customer_id = c("A","A","B","C","C"),order_id = structure(1:7,.Label = c("1","2","3","4","5","6","7"),class = "factor"),order_date = structure(c(17897,17901,17912,17902,17903,17905),class = "Date"),ship_date = structure(c(17926,17906,17914,17904,17906),class = "Date")),row.names = c(NA,-7L),class = c("tbl_df","tbl","data.frame"))

解决方法

这是我在R中构建此工作流程的方式，请注意：复制Tableau中的功能将非常困难。

# Install pacakges if they are not already installed: necessary_packages => vector
necessary_packages <- c("readxl")

# Create a vector containing the names of any packages needing installation:
# new_pacakges => vector
new_packages <- necessary_packages[!(necessary_packages %in%
                                       installed.packages()[,"Package"])]

# If the vector has more than 0 values,install the new pacakges
# (and their) associated dependencies:
if(length(new_packages) > 0){install.packages(new_packages,dependencies = TRUE)}

# Initialise the packages in the session:
lapply(necessary_packages,require,character.only = TRUE)

# Store a scalar of the link to the data: durl => character scalar
durl <- "http://www.tableau.com/sites/default/files/training/global_superstore.zip"

# Store the path to the temporary directory: tmpdir_path => character scalar
tmpdir_path <- tempdir()

# Store a character scalar denoting the link to the zipped directory
# that is to be created: zip_path => character scalar
zip_path <- paste0(tmpdir_path,"/tableau.zip")

# Store a character scalar denoting the link to the unzipped directory
# that is to be created: unzip_path => character scalar
unzip_path <- paste0(tmpdir_path,"/global_superstore")

# Download the zip file: global_superstore.zip => stdout (zip_path)
download.file(durl,zip_path)

# Unzip the file into the unzip directory: tableau.zip => stdout (global_superstore)
unzip(zipfile = zip_path,exdir = unzip_path)

# Read in the excel file: df => data.frame
df <- read_xls(normalizePath(list.files(unzip_path,full.names = TRUE)))

# Regex the vector names to fit with R convention: names(df) => character vector 
names(df) <- gsub("\\W+","_",tolower(trimws(names(df),"both")))

# Allocate some memory by creating an empty list the same size as the number of 
# customers: df_list => list
df_list <- vector("list",length(unique(df$customer_id)))

# Split the data.frame into the list by the customer_id: df_list => lis
df_list <- with(df,split(df,customer_id))      

# Sort the data (by date) and test whether or not each customer waited for their 
# order before ordering again: orders_prior_to_delivery => data.frame
orders_prior_to_delivery <- data.frame(do.call("rbind",Map(function(x){
  # Order the data.frame: y => data.frame
  y <- x[order(x$order_date),]
  # Return only the observations where the customer didn't wait: 
  # data.frame => GlobalEnv()
  with(y,y[c(FALSE,apply(data.frame(sapply(order_date[-1],`<`,ship_date[-nrow(y)])),2,any)),])
},df_list)),row.names = NULL,stringsAsFactors = FALSE)

# Unique customers and orders that were ordered prior to shipping the 
# previous order: cust_orders_prior_to_delivery => data.frame
cust_orders_prior_to_delivery <- 
  unique(orders_prior_to_delivery[,c("order_id","customer_id")])

编辑：我先前的回答未正确处理“订购日期==发货日期”的情况。

我假设您已经将数据加载到名为df的对象中。您可以使用@hello_friend的代码的第一部分来实现这一点。

library(tidyverse)
df %>% 
  distinct(`Customer ID`,`Order ID`,`Order Date`,`Ship Date`) %>% 
  arrange(`Customer ID`,`Ship Date`) %>% 
  mutate(sort_key = row_number()) %>% 
  pivot_longer(c(`Order Date`,`Ship Date`),names_to = "Activity",names_pattern = "(.*) Date",values_to = "Date") %>% 
  mutate(Activity = factor(Activity,ordered = TRUE,levels = c("Order","Ship")),Open = if_else(Activity == "Order",1,-1)) %>% 
  group_by(`Customer ID`) %>% 
  arrange(Date,sort_key,Activity,.by_group = TRUE) %>% 
  mutate(Open = cumsum(Open)) %>% 
  ungroup %>% 
  filter(Open > 1,Activity == "Order") %>% 
  select(`Customer ID`,`Order ID`)

首先，仅获取不同的订单和客户ID，否则来自同一订单的多个商品会造成混乱，并导致错误的结果。然后，旋转数据，使每个订单变成两行，每行代表一个不同的活动：订购或装运。我们创建未结订单数量的总计。您正在寻找何时达到两个或更多。

我对活动使用有序因子，以确保在关闭订单之前始终打开订单。当订单日期和发货日期相同时，这很重要。

我使用特殊的sort_key列来确保在打开新订单之前，我关闭了旧订单，以防客户在同一天发货。您可能需要相反的逻辑。

所有这些假定给定的客户ID和订单ID在数据中仅出现一次，实际上在您的数据集中是不正确的，如您所见：

df %>% group_by(`Customer ID`,`Order ID`) %>% filter(n_distinct(`Ship Date`)> 1) %>% select(1:9)

如何知道在下一个订单的交货/接收之前下一个订单的客户？在R中

如何解决如何知道在下一个订单的交货/接收之前下一个订单的客户？在R中

解决方法

相关推荐