如何在sparklyr中训练ML模型并预测另一个数据帧的新值?

ℕʘʘ*_*ḆḽḘ 6 r apache-spark apache-spark-ml sparklyr

请考虑以下示例

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))

dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)

> dtrain_spark
# Source:   table<dtrain> [?? x 3]
# Database: spark_connection
  text                     doc_id class
  <chr>                     <int> <dbl>
1 Chinese Beijing Chinese       1     1
2 Chinese Chinese Shanghai      2     1
3 Chinese Macao                 3     1
4 Tokyo Japan Chinese           4     0
Run Code Online (Sandbox Code Playgroud)

在这里,我有经典的Naive Bayes示例,class用于识别属于该China类别的文档.

我可以sparklyr通过执行以下操作来运行Naives Bayes分类器:

dtrain_spark %>% 
ft_tokenizer(input.col = "text", output.col = "tokens") %>% 
ft_count_vectorizer(input_col = 'tokens', output_col = 'myvocab') %>% 
  select(myvocab, class) %>%  
  ml_naive_bayes( label_col = "class", 
                  features_col = "myvocab", 
                  prediction_col = "pcol",
                  probability_col = "prcol", 
                  raw_prediction_col = "rpcol",
                  model_type = "multinomial", 
                  smoothing = 0.6, 
                  thresholds = c(0.2, 0.4))
Run Code Online (Sandbox Code Playgroud)

哪个输出:

NaiveBayesModel (Transformer)
<naive_bayes_5e946aec597e> 
 (Parameters -- Column Names)
  features_col: myvocab
  label_col: class
  prediction_col: pcol
  probability_col: prcol
  raw_prediction_col: rpcol
 (Transformer Info)
  num_classes:  int 2 
  num_features:  int 6 
  pi:  num [1:2] -1.179 -0.368 
  theta:  num [1:2, 1:6] -1.417 -0.728 -2.398 -1.981 -2.398 ... 
  thresholds:  num [1:2] 0.2 0.4 
Run Code Online (Sandbox Code Playgroud)

但是,我有两个主要问题:

  1. 如何在样本中评估此分类器的性能?准确度指标在哪里?

  2. 更重要的是,我如何使用这种经过训练的模型来预测新值,例如,在下面的spark测试数据框中?

测试数据:

dtest <- data_frame(text = c("Chinese Chinese Chinese Tokyo Japan",
                             "random stuff"))

dtest_spark <- copy_to(sc, dtest, overwrite = TRUE)

> dtest_spark
# Source:   table<dtest> [?? x 1]
# Database: spark_connection
  text                               
  <chr>                              
1 Chinese Chinese Chinese Tokyo Japan
2 random stuff 
Run Code Online (Sandbox Code Playgroud)

谢谢!

use*_*411 8

如何在样本中评估此分类器的性能?准确度指标在哪里?

通常(有一些模型提供某种形式的摘要),对训练数据集的评估是Apache Spark中的一个单独步骤.这非常适合原生PipelineAPI.

背景:

Spark ML Pipelines主要由两种类型的对象构建:

  • Transformers- 提供transform方法的对象,映射DataFrame到更新DataFrame.

    你可以transformTransformerml_transform方法.

  • Estimators- 提供fit映射DataFrame到的方法的对象Transfomer.按照惯例,对应Estimator/ Transformer对称为Foo/ FooModel.

    您可以fit Estimatorsparklyr使用ml_fit模式.

另外,ML管道可以与Evaluators(参见ml_*_evaluatorml_*_eval方法)组合,可以用于根据模型生成的列(通常是概率列或原始预测)计算转换数据的不同度量.

您可以申请Evaluator使用ml_evaluate方法.

相关组件是否包括交叉验证器和列车验证拆分,可用于参数调整.

示例:

sparklyr PipelineStages可,或懒惰地通过使热切评估(如在自己的代码),通过直接传递数据spark_connection实例并调用上述方法(ml_fit,ml_transform等).

这意味着您可以定义Pipeline如下:

pipeline <- ml_pipeline(
  ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
  ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
  ml_naive_bayes(sc, label_col = "class", 
              features_col = "myvocab", 
              prediction_col = "pcol",
              probability_col = "prcol", 
              raw_prediction_col = "rpcol",
              model_type = "multinomial", 
              smoothing = 0.6, 
              thresholds = c(0.2, 0.4),
              uid = "nb")
)
Run Code Online (Sandbox Code Playgroud)

适合PipelineModel:

model <- ml_fit(pipeline, dtrain_spark)
Run Code Online (Sandbox Code Playgroud)

转换,并应用其中一个Evaluators:

ml_transform(model, dtrain_spark) %>% 
  ml_binary_classification_evaluator(
    label_col="class", raw_prediction_col= "rpcol", 
    metric_name = "areaUnderROC")
Run Code Online (Sandbox Code Playgroud)
[1] 1
Run Code Online (Sandbox Code Playgroud)

要么

evaluator <- ml_multiclass_classification_evaluator(
    sc,
    label_col="class", prediction_col= "pcol", 
    metric_name = "f1")

ml_evaluate(evaluator, ml_transform(model, dtrain_spark))
Run Code Online (Sandbox Code Playgroud)
[1] 1
Run Code Online (Sandbox Code Playgroud)

更重要的是,如何在下面的火花测试数据框中使用这种训练模型来预测新值?

使用ml_transform或者ml_predict(后者是一个包装器,它在输出上应用了进一步的转换):

ml_transform(model, dtest_spark)
Run Code Online (Sandbox Code Playgroud)
# Source:   table<sparklyr_tmp_cc651477ec7> [?? x 6]
# Database: spark_connection
  text                                tokens     myvocab   rpcol   prcol   pcol
  <chr>                               <list>     <list>    <list>  <list> <dbl>
1 Chinese Chinese Chinese Tokyo Japan <list [5]> <dbl [6]> <dbl [… <dbl …     0
2 random stuff                        <list [2]> <dbl [6]> <dbl [… <dbl …     1
Run Code Online (Sandbox Code Playgroud)

交叉验证:

示例中没有足够的数据,但您可以交叉验证并调整超参数,如下所示:

# dontrun
ml_cross_validator(
  dtrain_spark,
  pipeline, 
  list(nb=list(smoothing=list(0.8, 1.0))),  # Note that name matches UID
  evaluator=evaluator)
Run Code Online (Sandbox Code Playgroud)

备注:

  • 请记住,Spark的多项Naive Bayes实现只考虑二进制特征(0或不是0).
  • 如果您使用PipelinesVector列(不formula为基础的呼叫),我强烈建议使用标准(默认)列名:

    • label 因变量.
    • features 用于组装的自变量.
    • rawPrediction,prediction,probability用于原始预测,预测和概率列分别.