如何解决Scala 中的字符串匹配 Spark ML 算法
我是 Scala 和 Spark ML 的新手。尝试根据 PySpark String Matching 的推荐创建字符串匹配算法。基于它,到目前为止我能够在下面实现
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql._
import org.apache.spark.ml.feature.{HashingTF,MinHashLSH,NGram,RegexTokenizer}
import spark.implicits._
val vendorData = spark.read.option("header","true").option("inferSchema","true").json(path = "Data/*.json").as[vendorData]
// Load IMDB file into an Dataset
val imdbData = spark.read.option("header","True").option("inferSchema","True").option("sep","\t").csv(path = "Data/title.basics.tsv").as[imdbData]
// Remove Special chaacters
val newVendorData = vendorData.withColumn("newtitle",functions.regexp_replace(vendorData.col("title"),"[^A-Za-z0-9_]",""))
val newImdbData = imdbData.withColumn("newprimaryTitle",functions.regexp_replace(imdbData.col("primaryTitle"),""))
//Algo to find match percentage
val tokenizer = new RegexTokenizer().setPattern("").setInputCol("text").setMinTokenLength(1).setOutputCol("tokens")
val ngram = new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams")
val vectorizer = new HashingTF().setInputCol("ngrams").setOutputCol("vectors")
val lsh = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh")
val pipeline = new Pipeline().setStages(Array(tokenizer,ngram,vectorizer,lsh))
val model = pipeline.fit(newVendorData.select("newtitle"))
val vendorHashed = model.transform(newVendorData.select("newtitle"))
val imdbHashed = model.transform(newImdbData.select("newprimaryTitle"))
model.stages.last.asInstanceOf[ml.feature.MinHashLSHModel].approxSimilarityJoin(vendorHashed,imdbHashed,.85).show()
运行时出现以下错误。在进一步调查中,我发现问题出在第 1 行:
val model = pipeline.fit(newVendorData.select("newtitle"))
但看不到它是什么。
Exception in thread "main" java.lang.IllegalArgumentException: text does not exist. Available: newtitle
at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:168)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
at org.apache.spark.ml.UnaryTransformer.transformSchema(Transformer.scala:109)
at org.apache.spark.ml.Pipeline.$anonfun$transformSchema$4(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
at MatchingJob$.$anonfun$main$1(MatchingJob.scala:84)
at MatchingJob$.$anonfun$main$1$adapted(MatchingJob.scala:43)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at MatchingJob$.main(MatchingJob.scala:43)
at MatchingJob.main(MatchingJob.scala)
不确定我在做什么。
我的输入如下:
+------------------+
| newtitle|
+------------------+
| BhaagMilkhaBhaag|
| Fukrey|
| DilTohBacchaHaiJi|
|IndiasJungleHeroes|
| HrudayaGeethe|
**newprimaryTitle**
BhaagMilkhaBhaag
Fukrey
Carmencita
Leclownetseschiens
PauvrePierrot
Unbonbock
BlacksmithScene
ChineseOpiumDen
DilTohBacchaHaiJi
IndiasJungleHeroes
CorbettandCourtne
EdisonKinetoscopi
MissJerry
LeavingtheFactory
AkrobatischesPotp
TheArrivalofaTrain
ThePhotographical
TheWatererWatered
Autourdunecabine
Barquesortantduport
ItalienischerBaue
DasboxendeKnguruh
TheClownBarber
TheDerby1895
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。