是否有加速 sparklyr 随机样本和连接的示例？

如何解决是否有加速 sparklyr 随机样本和连接的示例？

我正在寻求帮助，试图了解如何让 sparklyr 以更智能、更快速的方式处理建模数据集的日常工作。

在 Hive 中使用结构化数据，我启用了相同的执行程序和 RAM 配置，这样我就可以在不到 15 分钟的时间内在 pyspark 中完成相同的流程……但是使用 sparklyr，大约需要 90分钟。

这个过程是我需要从一个巨大的数据集采样到数千个数据集......假设我选择伯努利目标样本的大小为 10,000 TRUE 和 20,000 FALSE，忽略分桶或暂时对这些表进行分区。

使用 sparklyr，从 Hive 查询，我是这样实现的：

referenceData<-"datamart.archivedFile" # About 200 million rows and 1000 columns
nonTargetCriteria<-"score>0 AND MonthsSinceActivity>12"  # Drops to about 80% of the file
targetCriteria<-"MonthsSinceActivity>0 AND MonthsSinceActivity<4" # Drops to about 1% of the file
WHERE.NT<-paste("WHERE ",nonTargetCriteria)
WHERE.T<-paste("WHERE ",targetCriteria)
nT<- 10000 # sample size of target behavior (let's say response is coded as target=1)
nNT<- 20000 # sample size of non-targets (let's say non-response is coded as target=0)
join_vars <- c('id','address')
names(join_vars) <- c('id','address')

## First I count my responders available
n.targetSDF <- hive_context(sc) %>% invoke('sql',paste("select count(1) AS N,",paste("SUM(CASE WHEN ",targetCriteria," THEN 1 ELSE 0 END)  nTar")," from ",targetData)) %>% sdf_collect()

## Then I count the full available population in the reference data set subject to criteria
n.referenceSDF <- hive_context(sc) %>% invoke('sql',nonTargetCriteria," THEN 1 ELSE 0 END) nNonTar"),referenceData)) %>% sdf_collect()

# Now I pull the target=1 group and join them to the earlier archived data
targetSDF.S <- hive_context(sc) %>% invoke('sql',paste("select ",paste0(join_vars,collapse=","),targetData,WHERE.T)) %>% 
  sdf_sample(fraction = 2*nT/n.targetSDF[2]) ## Bring in the count from above to define the fraction.

join.S <- tbl(sc,referenceData) %>% inner_join(targetSDF.S,by=join_vars) %>% mutate(target=1)

# Now I initialize the response=0 data 
referenceSDF.S <- hive_context(sc) %>% invoke('sql',paste("select *,0 target from ",referenceData,WHERE.NT)) 

# Bring in the count from above to define the fraction.
N.NT<-n.referenceSDF[2]

# Union the two samples and collect into R.
df <- sdf_bind_rows(join.S,referenceSDF.S %>% sdf_sample(fraction = 1.2*nNT/N.NT)) %>% sdf_collect()