如何解决mllib中的梯度增强树中的类型错误
我尝试对某些混合类型的数据运行梯度增强树算法:
[('feature1','bigint'),('feature2','int'),('label','double')]
我尝试了以下
from pyspark.mllib.tree import GradientBoostedTrees,GradientBoostedTreesModel
from pyspark.ml.feature import VectorAssembler
from pyspark.mllib.linalg import Vector as MLLibVector,Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
vectorAssembler = VectorAssembler(inputCols = ["feature1","feature2"],outputCol = "features")
data_assembled = vectorAssembler.transform(data)
data_assembled = data_assembled.select(['features','label'])
data_assembled = data_assembled.select(F.col("features"),F.col("label"))\
.rdd\
.map(lambda row: LabeledPoint(MLLibVectors.fromML(row.label),MLLibVectors.fromML(row.features)))
(trainingData,testData) = data_assembled.randomSplit([0.9,0.1])
model = GradientBoostedTrees.trainRegressor(trainingData,categoricalFeaturesInfo={},numIterations=100)
但是我遇到以下错误:
TypeError: Unsupported vector type <class 'float'>
但是我的类型实际上都不是浮动的。另外,如果相关,feature2是二进制的。
解决方法
我最终避免了mllib的实现,转而使用Spark ML:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import GBTRegressor
vectorAssembler = VectorAssembler(inputCols = ["feature1","feature2"],outputCol = "features")
data_assembled = vectorAssembler.transform(data)
data_assembled = data_assembled.select(F.col("label"),F.col("features"))
(trainingData,testData) = data_assembled.randomSplit([0.7,0.3])
gbt_model = GBTRegressor(featuresCol="features",maxIter=10).fit(trainingData)
Python没有LabeledPoint对象所需的双精度类型,因此我假设从pyspark进行映射会导致转换为float。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。