我正在尝试使用Mllib构建一个非常简单的scala独立应用程序,但在尝试构建程序时出现以下错误:
Object Mllib is not a member of package org.apache.spark
Run Code Online (Sandbox Code Playgroud)
然后,我意识到我必须添加Mllib作为依赖,如下所示:
version := "1"
scalaVersion :="2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.1.0",
"org.apache.spark" %% "spark-mllib" % "1.1.0"
)
Run Code Online (Sandbox Code Playgroud)
但是,我在这里得到一个错误:
unresolved dependency spark-core_2.10.4;1.1.1 : not found
所以我不得不修改它
"org.apache.spark" % "spark-core_2.10" % "1.1.1",
但仍有一个错误说:
unresolved dependency spark-mllib;1.1.1 : not found
任何人都知道如何在.sbt文件中添加Mllib的依赖关系?
我尝试使用火花1.1.0提供的新TFIDF算法.我正在用Java编写我的MLLib工作,但我无法弄清楚如何使TFIDF实现工作.由于某种原因,IDFModel仅接受JavaRDD作为方法转换的输入而不是简单的Vector.如何使用给定的类为我的LabledPoints建模TFIDF向量?
注意:文档行的格式为[标签; 文本]
到目前为止我的代码:
// 1.) Load the documents
JavaRDD<String> data = sc.textFile("/home/johnny/data.data.new");
// 2.) Hash all documents
HashingTF tf = new HashingTF();
JavaRDD<Tuple2<Double, Vector>> tupleData = data.map(new Function<String, Tuple2<Double, Vector>>() {
@Override
public Tuple2<Double, Vector> call(String v1) throws Exception {
String[] data = v1.split(";");
List<String> myList = Arrays.asList(data[1].split(" "));
return new Tuple2<Double, Vector>(Double.parseDouble(data[0]), tf.transform(myList));
}
});
tupleData.cache();
// 3.) Create a flat RDD with all vectors
JavaRDD<Vector> hashedData = tupleData.map(new Function<Tuple2<Double,Vector>, Vector>() …Run Code Online (Sandbox Code Playgroud) 我正在尝试提取我使用PySpark训练的随机森林对象的类概率.但是,我没有在文档中的任何地方看到它的示例,也不是它的方法RandomForestModel.
如何从RandomForestModelPySpark中的分类器中提取类概率?
以下是文档中提供的示例代码,它仅提供最终类(而不是概率):
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
# Note: Use larger numTrees in practice.
# Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, …Run Code Online (Sandbox Code Playgroud) 当我试图在这个文件夹上运行它时,它每次都抛出ExecutorLostFailure
嗨,我是Spark的初学者.我试图在Spark 1.4.1上运行一个带有8个从属节点的工作,每个3.2 GB磁盘有11.7 GB内存.我正在从一个从节点(来自8个节点)运行Spark任务(因此,每个节点上只有大约4.8 gb的0.7存储分数)并使用Mesos作为Cluster Manager.我正在使用此配置:
spark.master mesos://uc1f-bioinfocloud-vamp-m-1:5050
spark.eventLog.enabled true
spark.driver.memory 6g
spark.storage.memoryFraction 0.7
spark.core.connection.ack.wait.timeout 800
spark.akka.frameSize 50
spark.rdd.compress true
Run Code Online (Sandbox Code Playgroud)
我试图在14 GB的数据文件夹上运行Spark MLlib朴素贝叶斯算法.(当我在6 GB文件夹上运行任务时没有问题)我正在从谷歌存储中读取此文件夹作为RDD并将32作为分区参数.(我也尝试过增加分区).然后使用TF创建特征向量并基于此进行预测.但是当我试图在这个文件夹上运行它时,它每次都会抛出ExecutorLostFailure.我尝试了不同的配置,但没有任何帮助.可能是我遗漏了一些非常基本但却无法弄清楚的东西.任何帮助或建议都将非常有价值.
日志是:
15/07/21 01:18:20 ERROR TaskSetManager: Task 3 in stage 2.0 failed 4 times; aborting job
15/07/21 01:18:20 INFO TaskSchedulerImpl: Cancelling stage 2
15/07/21 01:18:20 INFO TaskSchedulerImpl: Stage 2 was cancelled
15/07/21 01:18:20 INFO DAGScheduler: ResultStage 2 (collect at /opt/work/V2ProcessRecords.py:213) failed in 28.966 s
15/07/21 01:18:20 INFO DAGScheduler: Executor lost: …Run Code Online (Sandbox Code Playgroud) 我对Spark和Scala相对较新.
我从以下数据帧开始(单个列由密集的双打矢量组成):
scala> val scaledDataOnly_pruned = scaledDataOnly.select("features")
scaledDataOnly_pruned: org.apache.spark.sql.DataFrame = [features: vector]
scala> scaledDataOnly_pruned.show(5)
+--------------------+
| features|
+--------------------+
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
+--------------------+
Run Code Online (Sandbox Code Playgroud)
直接转换为RDD会生成org.apache.spark.rdd.RDD [org.apache.spark.sql.Row]的实例:
scala> val scaledDataOnly_rdd = scaledDataOnly_pruned.rdd
scaledDataOnly_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[32] at rdd at <console>:66
Run Code Online (Sandbox Code Playgroud)
有谁知道如何将此DF转换为org.apache.spark.rdd.RDD [org.apache.spark.mllib.linalg.Vector]的实例?到目前为止,我的各种尝试都没有成功.
提前感谢您的任何指示!
为了构建NaiveBayes多类分类器,我使用CrossValidator来选择管道中的最佳参数:
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(10)
val cvModel = cv.fit(trainingSet)
Run Code Online (Sandbox Code Playgroud)
管道包含通常的变换器和估计器,顺序如下:Tokenizer,StopWordsRemover,HashingTF,IDF,最后是NaiveBayes.
是否可以访问为最佳模型计算的指标?
理想情况下,我想访问所有模型的指标,以了解更改参数如何改变分类的质量.但目前,最好的模型已经足够好了.
仅供参考,我使用的是Spark 1.6.0
我DenseVector RDD喜欢这个
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])]
Run Code Online (Sandbox Code Playgroud)
我想把它转换成一个Dataframe.我试过这样的
>>> spark.createDataFrame(frequencyDenseVectors, ['rawfeatures']).collect()
Run Code Online (Sandbox Code Playgroud)
它给出了这样的错误
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 520, …Run Code Online (Sandbox Code Playgroud) apache-spark pyspark apache-spark-ml apache-spark-mllib apache-spark-2.0
我正在预测批量训练模型的流程之间的评级.我正在使用此处概述的方法:ALS模型 - 如何生成full_u*v ^ t*v?
! rm -rf ml-1m.zip ml-1m
! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip
! mv ml-1m/ratings.dat .
from pyspark.mllib.recommendation import Rating
ratingsRDD = sc.textFile('ratings.dat') \
.map(lambda l: l.split("::")) \
.map(lambda p: Rating(
user = int(p[0]),
product = int(p[1]),
rating = float(p[2]),
)).cache()
from pyspark.mllib.recommendation import ALS
rank = 50
numIterations = 20
lambdaParam = 0.1
model = ALS.train(ratingsRDD, rank, numIterations, lambdaParam)
Run Code Online (Sandbox Code Playgroud)
然后提取产品功能......
import json
import numpy as np
pf = model.productFeatures()
pf_vals = pf.sortByKey().values().collect()
pf_keys …Run Code Online (Sandbox Code Playgroud) 有没有办法以在线学习的方式训练LDA模型,即.加载以前的火车模型,并用新文件更新?
machine-learning lda apache-spark apache-spark-ml apache-spark-mllib
根据LinearRegressionSummary(Spark 2.1.0 JavaDoc),p值仅适用于"普通"求解器.
该值仅在使用"普通"求解器时可用.
到底是什么"正常"解算器?
我这样做:
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.regression.LinearRegressionModel
import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, ParamGridBuilder}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
.
.
.
val (trainingData, testData): (DataFrame, DataFrame) =
com.acme.pta.accuracy.Util.splitData(output, testProportion)
.
.
.
val lr =
new org.apache.spark.ml.regression.LinearRegression()
.setSolver("normal").setMaxIter(maxIter)
val pipeline = new Pipeline()
.setStages(Array(lr))
val paramGrid = new ParamGridBuilder()
.addGrid(lr.elasticNetParam, Array(0.2, 0.4, 0.8, 0.9))
.addGrid(lr.regParam, Array(0,6, 0.3, 0.1, 0.01))
.build()
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds) // Use 3+ …Run Code Online (Sandbox Code Playgroud)