我是Spark SQL DataFrames和ML的新手(PySpark).如何创建服装标记器,例如删除停用词并使用nltk中的某些库?我可以延长默认值吗?
谢谢.
我想Estimator
在PySpark MLlib中构建一个简单的自定义.我在这里可以编写一个自定义的Transformer,但我不知道如何在一个Estimator
.我也不明白是什么@keyword_only
以及为什么我需要这么多的二传手和吸气剂.Scikit-learn似乎有一个适用于自定义模型的文档(请参阅此处,但PySpark没有.
示例模型的伪代码:
class NormalDeviation():
def __init__(self, threshold = 3):
def fit(x, y=None):
self.model = {'mean': x.mean(), 'std': x.std()]
def predict(x):
return ((x-self.model['mean']) > self.threshold * self.model['std'])
def decision_function(x): # does ml-lib support this?
Run Code Online (Sandbox Code Playgroud) python apache-spark pyspark apache-spark-ml apache-spark-mllib
在开始使用pyspark.ml
管道 API 时,我发现自己为典型的预处理任务编写了自定义转换器,以便在管道中使用它们。例子:
from pyspark.ml import Pipeline, Transformer
class CustomTransformer(Transformer):
# lazy workaround - a transformer needs to have these attributes
_defaultParamMap = dict()
_paramMap = dict()
_params = dict()
class ColumnSelector(CustomTransformer):
"""Transformer that selects a subset of columns
- to be used as pipeline stage"""
def __init__(self, columns):
self.columns = columns
def _transform(self, data):
return data.select(self.columns)
class ColumnRenamer(CustomTransformer):
"""Transformer renames one column"""
def __init__(self, rename):
self.rename = rename
def _transform(self, data):
(colNameBefore, colNameAfter) = self.rename
return data.withColumnRenamed(colNameBefore, …
Run Code Online (Sandbox Code Playgroud) 当我在Azure Databricks中实现这部分python代码时:
class clustomTransformations(Transformer):
<code>
custom_transformer = customTransformations()
....
pipeline = Pipeline(stages=[custom_transformer, assembler, scaler, rf])
pipeline_model = pipeline.fit(sample_data)
pipeline_model.save(<your path>)
Run Code Online (Sandbox Code Playgroud)
当我尝试保存管道时,我得到了这个:
AttributeError: 'customTransformations' object has no attribute '_to_java'
有什么工作吗?