小编Som*_*Som的帖子

无法在 Pyspark 中执行用户定义的函数 RegexTokenizer

我正在尝试使用 Pyspark 使用数据中的文本特征执行文本分类。下面是我的文本预处理代码,该代码未能执行用户定义的函数 RegexTokenizer。

    tokenizer = RegexTokenizer(inputCol = "text", outputCol = "words", pattern = "\\W")
    add_stopwords = StopWordsRemover.loadDefaultStopWords("english")
    remover = StopWordsRemover(inputCol = "words", outputCol = "filtered").setStopWords(add_stopwords)
    label_stringIdx = StringIndexer(inputCol = "label", outputCol = "target")
    countVectors = CountVectorizer(inputCol="filtered", outputCol="features", vocabSize=1000, minDF=5)
    #pipleline for text pre-processing
    pipeline = Pipeline(stages=[tokenizer,remover, countVectors, label_stringIdx])

    #fit the dat for the pipeline
    pipelineFit = pipeline.fit(dataset)
    dataset = pipelineFit.transform(dataset)
    dataset.show()
Run Code Online (Sandbox Code Playgroud)

错误是:

/usr/local/lib/python3.6/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> …
Run Code Online (Sandbox Code Playgroud)

text apache-spark-sql pyspark apache-spark-ml apache-spark-mllib

1
推荐指数
1
解决办法
2522
查看次数