我正在尝试使用 Pyspark 使用数据中的文本特征执行文本分类。下面是我的文本预处理代码,该代码未能执行用户定义的函数 RegexTokenizer。
tokenizer = RegexTokenizer(inputCol = "text", outputCol = "words", pattern = "\\W")
add_stopwords = StopWordsRemover.loadDefaultStopWords("english")
remover = StopWordsRemover(inputCol = "words", outputCol = "filtered").setStopWords(add_stopwords)
label_stringIdx = StringIndexer(inputCol = "label", outputCol = "target")
countVectors = CountVectorizer(inputCol="filtered", outputCol="features", vocabSize=1000, minDF=5)
#pipleline for text pre-processing
pipeline = Pipeline(stages=[tokenizer,remover, countVectors, label_stringIdx])
#fit the dat for the pipeline
pipelineFit = pipeline.fit(dataset)
dataset = pipelineFit.transform(dataset)
dataset.show()
Run Code Online (Sandbox Code Playgroud)
错误是:
/usr/local/lib/python3.6/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> …Run Code Online (Sandbox Code Playgroud) text apache-spark-sql pyspark apache-spark-ml apache-spark-mllib