如何解决Databricks Spark 集群 MLlib 线性回归与本地笔记本电脑上的 Spark MLlib 性能相同吗?
使用相同的代码,我在我的笔记本电脑(16 GB 内存,8 核)和 Azure Databricks 7.3 LTS 集群上的 10,000 行上运行玩具线性回归,该集群有 8 个工作线程(16 GB 内存,每个 8 核)和两者都在相同的时间内完成回归,约 240 秒。这是配置问题还是我期望这应该在集群上更快?
代码
设置
sc <- sparklyr::spark_connect(master = "local")
library(magrittr)
n = 10000
df <-
data.frame(
num1 = runif(n,10),num2 = rnorm(n,100,fac1num = sample(1:50,n,replace = TRUE),fac2num1 = sample(1:50,fac2num2 = sample(1:50,noise = rnorm(n,10)
)
##### my real problem uses factors with ~50 and ~2,000 levels each
df$fac1 = state.name[df$fac1num]
df$fac2 = paste0(state.name[df$fac2num1],state.name[df$fac2num2])
df$y = df$num1 + df$num2 * 12 - log(df$fac1num) + df$fac2num1*df$fac2num2/1000 + df$noise
##### copy from R to Spark memory
df_tbl <- sparklyr::copy_to(sc,df,"df_spark")
基准
### ~240 seconds (same as local laptop!)
system.time(
mdl_df <- sparklyr::ml_linear_regression(df_tbl,formula = y ~ num1 * fac1 * fac2 + num2)
)
尝试替代方案
##### copy explicitly to Spark memory (I don't think this does anything here since it is already in memory,##### but I do this in my real problem after SQL server has done some work)
system.time(
df_tbl_cached <- df_tbl %>% dplyr::compute()
)
##### ~230 seconds (I think same as above?)
system.time(
mdl_df_cached <- sparklyr::ml_linear_regression(df_tbl_cached,formula = y ~ num1 * fac1 * fac2 + num2)
)
##### just playing around wondering if I needed to do something different
system.time(
df_tbl_persist <- sparklyr::sdf_persist(df_tbl,storage.level = "MEMORY_ONLY",name = "df_tbl_persist_spark")
)
##### ~230 seconds
system.time(
mdl_df_persist <- sparklyr::ml_linear_regression(df_tbl_persist,formula = y ~ num1 * fac1 * fac2 + num2)
)
##### playing around again with Spark memory
system.time({
sparklyr::sdf_register(df_tbl,"df_tbl_register")
sparklyr::tbl_cache(sc,"df_tbl_register")
})
##### ~230 seconds
system.time(
mdl_df_register <- sparklyr::ml_linear_regression(df_tbl,formula = y ~ num1 * fac1 * fac2 + num2)
)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。