Nik*_*iko 19 python nltk apache-spark pyspark apache-spark-ml
我是Spark SQL DataFrames和ML的新手(PySpark).如何创建服装标记器,例如删除停用词并使用nltk中的某些库?我可以延长默认值吗?
谢谢.
zer*_*323 32
我可以延长默认值吗?
并不是的.Default Tokenizer是一个子类pyspark.ml.wrapper.JavaTransformer,与其他transfromers和estimators一样pyspark.ml.feature,将实际处理委托给它的Scala对应物.既然你想使用Python,你应该pyspark.ml.pipeline.Transformer直接扩展.
import nltk
from pyspark import keyword_only  ## < 2.0 -> pyspark.ml.util.keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param, Params, TypeConverters
# Available in PySpark >= 2.3.0 
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable  
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
class NLTKWordPunctTokenizer(
        Transformer, HasInputCol, HasOutputCol,
        # Credits https://stackoverflow.com/a/52467470
        # by https://stackoverflow.com/users/234944/benjamin-manns
        DefaultParamsReadable, DefaultParamsWritable):
    stopwords = Param(Params._dummy(), "stopwords", "stopwords",
                      typeConverter=TypeConverters.toListString)
    @keyword_only
    def __init__(self, inputCol=None, outputCol=None, stopwords=None):
        super(NLTKWordPunctTokenizer, self).__init__()
        self.stopwords = Param(self, "stopwords", "")
        self._setDefault(stopwords=[])
        kwargs = self._input_kwargs
        self.setParams(**kwargs)
    @keyword_only
    def setParams(self, inputCol=None, outputCol=None, stopwords=None):
        kwargs = self._input_kwargs
        return self._set(**kwargs)
    def setStopwords(self, value):
        return self._set(stopwords=list(value))
    def getStopwords(self):
        return self.getOrDefault(self.stopwords)
    # Required in Spark >= 3.0
    def setInputCol(self, value):
        """
        Sets the value of :py:attr:`inputCol`.
        """
        return self._set(inputCol=value)
    # Required in Spark >= 3.0
    def setOutputCol(self, value):
        """
        Sets the value of :py:attr:`outputCol`.
        """
        return self._set(outputCol=value)
    def _transform(self, dataset):
        stopwords = set(self.getStopwords())
        def f(s):
            tokens = nltk.tokenize.wordpunct_tokenize(s)
            return [t for t in tokens if t.lower() not in stopwords]
        t = ArrayType(StringType())
        out_col = self.getOutputCol()
        in_col = dataset[self.getInputCol()]
        return dataset.withColumn(out_col, udf(f, t)(in_col))
sentenceDataFrame = spark.createDataFrame([
  (0, "Hi I heard about Spark"),
  (0, "I wish Java could use case classes"),
  (1, "Logistic regression models are neat")
], ["label", "sentence"])
tokenizer = NLTKWordPunctTokenizer(
    inputCol="sentence", outputCol="words",  
    stopwords=nltk.corpus.stopwords.words('english'))
tokenizer.transform(sentenceDataFrame).show()
对于自定义Python,Estimator请参阅如何在PySpark mllib中滚动自定义估算器
⚠此答案取决于内部API,并与Spark 2.0.3,2.1.1,2.2.0或更高版本(SPARK-19348)兼容.有关与以前Spark版本兼容的代码,请参见修订版8.
| 归档时间: | 
 | 
| 查看次数: | 11290 次 | 
| 最近记录: |