微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

scala – Spark随机森林二进制分类器指标

在Spark Mllib(F score,AUROC,AUPRC等)中训练随机森林二元分类器模型时,我们如何获得模型指标?

问题是BinaryClassificationMetrics采用概率,而RandomForest分类器的预测方法返回离散值0或1.

见:https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification

RandomForest.trainClassifier没有任何clearThreshold方法,这将使其返回概率而不是离散的0或1标签.

解决方法

我们需要使用基于新的ml DataFrames的API来获取概率,而不是基于RDD的mllib API.

更新

以下是Spark文档的更新示例,以使用BinaryClassificationEvaluator并显示指标:接收器操作特性下的区域(AUROC)和精确调用曲线下的区域(AUPRC).

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString,StringIndexer,VectorIndexer}

// Load and parse the data file,converting it to a DataFrame.
val data = sqlContext.read.format("libsvm").load("D:/Sources/spark/data/mllib/sample_libsvm_data.txt")

// Index labels,adding Metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setoutputCol("indexedLabel")
  .fit(data)

// Automatically identify categorical features,and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setoutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing)
val Array(trainingData,testData) = data.randomSplit(Array(0.7,0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setoutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer,featureIndexer,rf,labelConverter))

// Train model.  This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions
  .select("indexedLabel","rawPrediction","prediction")
  .show()

val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setRawPredictionCol("rawPrediction")

def printlnMetric(metricName: String): Unit = {
  println(metricName + " = " + binaryClassificationEvaluator.setMetricName(metricName).evaluate(predictions))
}

printlnMetric("areaUnderROC")
printlnMetric("areaUnderPR")

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐