ℕʘʘ*_*ḆḽḘ 6 r apache-spark apache-spark-ml sparklyr
请考虑以下示例
dtrain <- data_frame(text = c("Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"Chinese Macao",
"Tokyo Japan Chinese"),
doc_id = 1:4,
class = c(1, 1, 1, 0))
dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)
> dtrain_spark
# Source: table<dtrain> [?? x 3]
# Database: spark_connection
text doc_id class
<chr> <int> <dbl>
1 Chinese Beijing Chinese 1 1
2 Chinese Chinese Shanghai 2 1
3 Chinese Macao 3 1
4 Tokyo Japan Chinese 4 0
Run Code Online (Sandbox Code Playgroud)
在这里,我有经典的Naive Bayes示例,class用于识别属于该China类别的文档.
我可以sparklyr通过执行以下操作来运行Naives Bayes分类器:
dtrain_spark %>%
ft_tokenizer(input.col = "text", output.col = "tokens") %>%
ft_count_vectorizer(input_col = 'tokens', output_col = 'myvocab') %>%
select(myvocab, class) %>%
ml_naive_bayes( label_col = "class",
features_col = "myvocab",
prediction_col = "pcol",
probability_col = "prcol",
raw_prediction_col = "rpcol",
model_type = "multinomial",
smoothing = 0.6,
thresholds = c(0.2, 0.4))
Run Code Online (Sandbox Code Playgroud)
哪个输出:
NaiveBayesModel (Transformer)
<naive_bayes_5e946aec597e>
(Parameters -- Column Names)
features_col: myvocab
label_col: class
prediction_col: pcol
probability_col: prcol
raw_prediction_col: rpcol
(Transformer Info)
num_classes: int 2
num_features: int 6
pi: num [1:2] -1.179 -0.368
theta: num [1:2, 1:6] -1.417 -0.728 -2.398 -1.981 -2.398 ...
thresholds: num [1:2] 0.2 0.4
Run Code Online (Sandbox Code Playgroud)
但是,我有两个主要问题:
如何在样本中评估此分类器的性能?准确度指标在哪里?
更重要的是,我如何使用这种经过训练的模型来预测新值,例如,在下面的spark测试数据框中?
测试数据:
dtest <- data_frame(text = c("Chinese Chinese Chinese Tokyo Japan",
"random stuff"))
dtest_spark <- copy_to(sc, dtest, overwrite = TRUE)
> dtest_spark
# Source: table<dtest> [?? x 1]
# Database: spark_connection
text
<chr>
1 Chinese Chinese Chinese Tokyo Japan
2 random stuff
Run Code Online (Sandbox Code Playgroud)
谢谢!
如何在样本中评估此分类器的性能?准确度指标在哪里?
通常(有一些模型提供某种形式的摘要),对训练数据集的评估是Apache Spark中的一个单独步骤.这非常适合原生PipelineAPI.
背景:
Spark ML Pipelines主要由两种类型的对象构建:
Transformers- 提供transform方法的对象,映射DataFrame到更新DataFrame.
你可以transform用Transformer与ml_transform方法.
Estimators- 提供fit映射DataFrame到的方法的对象Transfomer.按照惯例,对应Estimator/ Transformer对称为Foo/ FooModel.
您可以fit Estimator在sparklyr使用ml_fit模式.
另外,ML管道可以与Evaluators(参见ml_*_evaluator和ml_*_eval方法)组合,可以用于根据模型生成的列(通常是概率列或原始预测)计算转换数据的不同度量.
您可以申请Evaluator使用ml_evaluate方法.
相关组件是否包括交叉验证器和列车验证拆分,可用于参数调整.
示例:
sparklyr PipelineStages可,或懒惰地通过使热切评估(如在自己的代码),通过直接传递数据spark_connection实例并调用上述方法(ml_fit,ml_transform等).
这意味着您可以定义Pipeline如下:
pipeline <- ml_pipeline(
ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
ml_naive_bayes(sc, label_col = "class",
features_col = "myvocab",
prediction_col = "pcol",
probability_col = "prcol",
raw_prediction_col = "rpcol",
model_type = "multinomial",
smoothing = 0.6,
thresholds = c(0.2, 0.4),
uid = "nb")
)
Run Code Online (Sandbox Code Playgroud)
适合PipelineModel:
model <- ml_fit(pipeline, dtrain_spark)
Run Code Online (Sandbox Code Playgroud)
转换,并应用其中一个Evaluators:
ml_transform(model, dtrain_spark) %>%
ml_binary_classification_evaluator(
label_col="class", raw_prediction_col= "rpcol",
metric_name = "areaUnderROC")
Run Code Online (Sandbox Code Playgroud)
[1] 1
Run Code Online (Sandbox Code Playgroud)
要么
evaluator <- ml_multiclass_classification_evaluator(
sc,
label_col="class", prediction_col= "pcol",
metric_name = "f1")
ml_evaluate(evaluator, ml_transform(model, dtrain_spark))
Run Code Online (Sandbox Code Playgroud)
[1] 1
Run Code Online (Sandbox Code Playgroud)
更重要的是,如何在下面的火花测试数据框中使用这种训练模型来预测新值?
使用ml_transform或者ml_predict(后者是一个包装器,它在输出上应用了进一步的转换):
ml_transform(model, dtest_spark)
Run Code Online (Sandbox Code Playgroud)
# Source: table<sparklyr_tmp_cc651477ec7> [?? x 6]
# Database: spark_connection
text tokens myvocab rpcol prcol pcol
<chr> <list> <list> <list> <list> <dbl>
1 Chinese Chinese Chinese Tokyo Japan <list [5]> <dbl [6]> <dbl [… <dbl … 0
2 random stuff <list [2]> <dbl [6]> <dbl [… <dbl … 1
Run Code Online (Sandbox Code Playgroud)
交叉验证:
示例中没有足够的数据,但您可以交叉验证并调整超参数,如下所示:
# dontrun
ml_cross_validator(
dtrain_spark,
pipeline,
list(nb=list(smoothing=list(0.8, 1.0))), # Note that name matches UID
evaluator=evaluator)
Run Code Online (Sandbox Code Playgroud)
备注:
如果您使用Pipelines与Vector列(不formula为基础的呼叫),我强烈建议使用标准(默认)列名:
label 因变量.features 用于组装的自变量.rawPrediction,prediction,probability用于原始预测,预测和概率列分别.