如何解决如何知道在下一个订单的交货/接收之前下一个订单的客户?在R中
我有一个拥有两个日期的大型数据库。例如。取得超级市场数据(http://www.tableau.com/sites/default/files/training/global_superstore.zip)“订单”表。
一个日期为订购日期,另一个日期为装运/交货日期(假设为交货日期)。我想知道下一个订单而无需等待任何先前订单的发货/交付的那些客户的所有订单的详细信息。
例如标识为“ ZC-21910”的客户于2014年6月12日下了ID为CA-2014-133928的订单,该订单于2014年6月18日发货。但是,同一客户在6月13日下了ID为“ IT-2014-3511710”的下订单2014年,即2014年6月18日之前(先前订单之一的发货日期)。
最好将所有此类订单(订单ID)过滤到单独的向量中。
如何在R中做到这一点?还是在Tableau中?
示例数据集
> dput(df)
structure(list(customer_id = c("A","A","B","C","C"),order_id = structure(1:7,.Label = c("1","2","3","4","5","6","7"),class = "factor"),order_date = structure(c(17897,17901,17912,17902,17903,17905),class = "Date"),ship_date = structure(c(17926,17906,17914,17904,17906),class = "Date")),row.names = c(NA,-7L),class = c("tbl_df","tbl","data.frame"))
解决方法
这是我在R中构建此工作流程的方式,请注意:复制Tableau中的功能将非常困难。
# Install pacakges if they are not already installed: necessary_packages => vector
necessary_packages <- c("readxl")
# Create a vector containing the names of any packages needing installation:
# new_pacakges => vector
new_packages <- necessary_packages[!(necessary_packages %in%
installed.packages()[,"Package"])]
# If the vector has more than 0 values,install the new pacakges
# (and their) associated dependencies:
if(length(new_packages) > 0){install.packages(new_packages,dependencies = TRUE)}
# Initialise the packages in the session:
lapply(necessary_packages,require,character.only = TRUE)
# Store a scalar of the link to the data: durl => character scalar
durl <- "http://www.tableau.com/sites/default/files/training/global_superstore.zip"
# Store the path to the temporary directory: tmpdir_path => character scalar
tmpdir_path <- tempdir()
# Store a character scalar denoting the link to the zipped directory
# that is to be created: zip_path => character scalar
zip_path <- paste0(tmpdir_path,"/tableau.zip")
# Store a character scalar denoting the link to the unzipped directory
# that is to be created: unzip_path => character scalar
unzip_path <- paste0(tmpdir_path,"/global_superstore")
# Download the zip file: global_superstore.zip => stdout (zip_path)
download.file(durl,zip_path)
# Unzip the file into the unzip directory: tableau.zip => stdout (global_superstore)
unzip(zipfile = zip_path,exdir = unzip_path)
# Read in the excel file: df => data.frame
df <- read_xls(normalizePath(list.files(unzip_path,full.names = TRUE)))
# Regex the vector names to fit with R convention: names(df) => character vector
names(df) <- gsub("\\W+","_",tolower(trimws(names(df),"both")))
# Allocate some memory by creating an empty list the same size as the number of
# customers: df_list => list
df_list <- vector("list",length(unique(df$customer_id)))
# Split the data.frame into the list by the customer_id: df_list => lis
df_list <- with(df,split(df,customer_id))
# Sort the data (by date) and test whether or not each customer waited for their
# order before ordering again: orders_prior_to_delivery => data.frame
orders_prior_to_delivery <- data.frame(do.call("rbind",Map(function(x){
# Order the data.frame: y => data.frame
y <- x[order(x$order_date),]
# Return only the observations where the customer didn't wait:
# data.frame => GlobalEnv()
with(y,y[c(FALSE,apply(data.frame(sapply(order_date[-1],`<`,ship_date[-nrow(y)])),2,any)),])
},df_list)),row.names = NULL,stringsAsFactors = FALSE)
# Unique customers and orders that were ordered prior to shipping the
# previous order: cust_orders_prior_to_delivery => data.frame
cust_orders_prior_to_delivery <-
unique(orders_prior_to_delivery[,c("order_id","customer_id")])
,
编辑:我先前的回答未正确处理“订购日期==发货日期”的情况。
我假设您已经将数据加载到名为df
的对象中。您可以使用@hello_friend的代码的第一部分来实现这一点。
library(tidyverse)
df %>%
distinct(`Customer ID`,`Order ID`,`Order Date`,`Ship Date`) %>%
arrange(`Customer ID`,`Ship Date`) %>%
mutate(sort_key = row_number()) %>%
pivot_longer(c(`Order Date`,`Ship Date`),names_to = "Activity",names_pattern = "(.*) Date",values_to = "Date") %>%
mutate(Activity = factor(Activity,ordered = TRUE,levels = c("Order","Ship")),Open = if_else(Activity == "Order",1,-1)) %>%
group_by(`Customer ID`) %>%
arrange(Date,sort_key,Activity,.by_group = TRUE) %>%
mutate(Open = cumsum(Open)) %>%
ungroup %>%
filter(Open > 1,Activity == "Order") %>%
select(`Customer ID`,`Order ID`)
首先,仅获取不同的订单和客户ID,否则来自同一订单的多个商品会造成混乱,并导致错误的结果。然后,旋转数据,使每个订单变成两行,每行代表一个不同的活动:订购或装运。我们创建未结订单数量的总计。您正在寻找何时达到两个或更多。
我对活动使用有序因子,以确保在关闭订单之前始终打开订单。当订单日期和发货日期相同时,这很重要。
我使用特殊的sort_key列来确保在打开新订单之前,我关闭了旧订单,以防客户在同一天发货。您可能需要相反的逻辑。
所有这些假定给定的客户ID和订单ID在数据中仅出现一次,实际上在您的数据集中是不正确的,如您所见:
df %>% group_by(`Customer ID`,`Order ID`) %>% filter(n_distinct(`Ship Date`)> 1) %>% select(1:9)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。