从包含元组类型的 cassandra表中将数据加载到Spark中时,我遇到了一个问题.我的系统规格如下.
代码片段:
val myDataFrame = sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "test3", "keyspace" -> "pa" , "cluster"
->"ClusterOne")).load.select($"id")
Run Code Online (Sandbox Code Playgroud)
"test3"是在cassandra中的键空间"pa"下创建的表名
"test3"的表结构
CREATE TABLE pa.test3 (
id int,
m1 Tuple<text, int>,
PRIMARY KEY (id)
Run Code Online (Sandbox Code Playgroud)
我收到了以下错误
java.util.NoSuchElementException: key not found: TupleType(Vector(TupleFieldDef(0,VarCharType), TupleFieldDef(1,IntType)))
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at org.apache.spark.sql.cassandra.DataTypeConverter$.catalystDataType(DataTypeConverter.scala:55)
at org.apache.spark.sql.cassandra.DataTypeConverter$.toStructField(DataTypeConverter.scala:61)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$$anonfun$schema$1$$anonfun$apply$1.apply(CassandraSourceRelation.scala:64)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$$anonfun$schema$1$$anonfun$apply$1.apply(CassandraSourceRelation.scala:64)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$$anonfun$schema$1.apply(CassandraSourceRelation.scala:64)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$$anonfun$schema$1.apply(CassandraSourceRelation.scala:64)
at scala.Option.getOrElse(Option.scala:120) …Run Code Online (Sandbox Code Playgroud) 我试图在图中绘制ROC曲线和Precision-Recall曲线.这些点是从Spark Mllib BinaryClassificationMetrics生成的.按照以下Spark https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html
[(1.0,1.0), (0.0,0.4444444444444444)] Precision
[(1.0,1.0), (0.0,1.0)] Recall
[(1.0,1.0), (0.0,0.6153846153846153)] - F1Measure
[(0.0,1.0), (1.0,1.0), (1.0,0.4444444444444444)]- Precision-Recall curve
[(0.0,0.0), (0.0,1.0), (1.0,1.0), (1.0,1.0)] - ROC curve
Run Code Online (Sandbox Code Playgroud) 我对BinaryClassificationMetrics(Mllib)输入感到困惑.按照Apache的火花1.6.0,我们需要传递predictedandlabel类型的(RDD[(Double,Double)])从转化的数据帧将具有预测概率(矢量)rawPrediction(矢量).
我已经从Predicted和label列创建了RDD [(Double,Double)].在BinaryClassificationMetrics对NavieBayesModel执行评估之后,我能够检索ROC,PR等.但是值是有限的,我无法使用从此生成的值绘制曲线.Roc包含4个值,PR包含3个值.
它是准备以正确的方式PredictedandLabel或者我需要使用rawPrediction列或概率列,而不是预测列?
scala machine-learning apache-spark apache-spark-ml apache-spark-mllib