相关疑难解决方法(0)

在Spark 2.0中访问向量列时出现MatchError

我正在尝试在JSON文件上创建LDA模型.

使用JSON文件创建spark上下文:

import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.builder
  .master("local")
  .appName("my-spark-app")
  .config("spark.some.config.option", "config-value")
  .getOrCreate()

 val df = spark.read.json("dbfs:/mnt/JSON6/JSON/sampleDoc.txt")
Run Code Online (Sandbox Code Playgroud)

显示df应该显示DataFrame

display(df)
Run Code Online (Sandbox Code Playgroud)

对文本进行标记

import org.apache.spark.ml.feature.RegexTokenizer

// Set params for RegexTokenizer
val tokenizer = new RegexTokenizer()
                .setPattern("[\\W_]+")
                .setMinTokenLength(4) // Filter away tokens with length < 4
                .setInputCol("text")
                .setOutputCol("tokens")

// Tokenize document
val tokenized_df = tokenizer.transform(df)
Run Code Online (Sandbox Code Playgroud)

这应该显示 tokenized_df

display(tokenized_df)
Run Code Online (Sandbox Code Playgroud)

得到 stopwords

%sh wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words > -O /tmp/stopwords
Run Code Online (Sandbox Code Playgroud)

可选:将停用词复制到tmp文件夹

%fs cp file:/tmp/stopwords dbfs:/tmp/stopwords
Run Code Online (Sandbox Code Playgroud)

收集所有的 stopwords

val stopwords = sc.textFile("/tmp/stopwords").collect()
Run Code Online (Sandbox Code Playgroud)

过滤掉了 stopwords

 import org.apache.spark.ml.feature.StopWordsRemover …
Run Code Online (Sandbox Code Playgroud)

scala apache-spark apache-spark-sql apache-spark-ml apache-spark-mllib

3
推荐指数
1
解决办法
3926
查看次数