我正在尝试在JSON文件上创建LDA模型.
使用JSON文件创建spark上下文:
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder
.master("local")
.appName("my-spark-app")
.config("spark.some.config.option", "config-value")
.getOrCreate()
val df = spark.read.json("dbfs:/mnt/JSON6/JSON/sampleDoc.txt")
Run Code Online (Sandbox Code Playgroud)
显示df应该显示DataFrame
display(df)
Run Code Online (Sandbox Code Playgroud)
对文本进行标记
import org.apache.spark.ml.feature.RegexTokenizer
// Set params for RegexTokenizer
val tokenizer = new RegexTokenizer()
.setPattern("[\\W_]+")
.setMinTokenLength(4) // Filter away tokens with length < 4
.setInputCol("text")
.setOutputCol("tokens")
// Tokenize document
val tokenized_df = tokenizer.transform(df)
Run Code Online (Sandbox Code Playgroud)
这应该显示 tokenized_df
display(tokenized_df)
Run Code Online (Sandbox Code Playgroud)
得到 stopwords
%sh wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words > -O /tmp/stopwords
Run Code Online (Sandbox Code Playgroud)
可选:将停用词复制到tmp文件夹
%fs cp file:/tmp/stopwords dbfs:/tmp/stopwords
Run Code Online (Sandbox Code Playgroud)
收集所有的 stopwords
val stopwords = sc.textFile("/tmp/stopwords").collect()
Run Code Online (Sandbox Code Playgroud)
过滤掉了 stopwords
import org.apache.spark.ml.feature.StopWordsRemover …Run Code Online (Sandbox Code Playgroud) scala apache-spark apache-spark-sql apache-spark-ml apache-spark-mllib