微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

是否有加速 sparklyr 随机样本和连接的示例?

如何解决是否有加速 sparklyr 随机样本和连接的示例?

我正在寻求帮助,试图了解如何让 sparklyr 以更智能、更快速的方式处理建模数据集的日常工作。

在 Hive 中使用结构化数据,我启用了相同的执行程序和 RAM 配置,这样我就可以在不到 15 分钟的时间内在 pyspark 中完成相同的流程……但是使用 sparklyr,大约需要 90分钟。

这个过程是我需要从一个巨大的数据集采样到数千个数据集......假设我选择伯努利目标样本的大小为 10,000 TRUE 和 20,000 FALSE,忽略分桶或暂时对这些表进行分区。

使用 sparklyr,从 Hive 查询,我是这样实现的:

referenceData<-"datamart.archivedFile" # About 200 million rows and 1000 columns
nonTargetCriteria<-"score>0 AND MonthsSinceActivity>12"  # Drops to about 80% of the file
targetCriteria<-"MonthsSinceActivity>0 AND MonthsSinceActivity<4" # Drops to about 1% of the file
WHERE.NT<-paste("WHERE ",nonTargetCriteria)
WHERE.T<-paste("WHERE ",targetCriteria)
nT<- 10000 # sample size of target behavior (let's say response is coded as target=1)
nNT<- 20000 # sample size of non-targets (let's say non-response is coded as target=0)
join_vars <- c('id','address')
names(join_vars) <- c('id','address')

## First I count my responders available
n.targetSDF <- hive_context(sc) %>% invoke('sql',paste("select count(1) AS N,",paste("SUM(CASE WHEN ",targetCriteria," THEN 1 ELSE 0 END)  nTar")," from ",targetData)) %>% sdf_collect()

## Then I count the full available population in the reference data set subject to criteria
n.referenceSDF <- hive_context(sc) %>% invoke('sql',nonTargetCriteria," THEN 1 ELSE 0 END) nNonTar"),referenceData)) %>% sdf_collect()

# Now I pull the target=1 group and join them to the earlier archived data
targetSDF.S <- hive_context(sc) %>% invoke('sql',paste("select ",paste0(join_vars,collapse=","),targetData,WHERE.T)) %>% 
  sdf_sample(fraction = 2*nT/n.targetSDF[2]) ## Bring in the count from above to define the fraction.

join.S <- tbl(sc,referenceData) %>% inner_join(targetSDF.S,by=join_vars) %>% mutate(target=1)

# Now I initialize the response=0 data 
referenceSDF.S <- hive_context(sc) %>% invoke('sql',paste("select *,0 target from ",referenceData,WHERE.NT)) 

# Bring in the count from above to define the fraction.
N.NT<-n.referenceSDF[2]

# Union the two samples and collect into R.
df <- sdf_bind_rows(join.S,referenceSDF.S %>% sdf_sample(fraction = 1.2*nNT/N.NT)) %>% sdf_collect() 

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。