小编raf*_*o92的帖子

Spark中的潜在Dirichlet分配(LDA)-复制模型

我想从pyspark ml-clustering包中保存LDA模型,并在保存后将其应用于训练和测试数据集。然而,尽管播下了种子,结果却有所不同。我的代码如下:

1)导入包

from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF
Run Code Online (Sandbox Code Playgroud)

2)准备数据集

countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())    
corpus = result_tfidf.select("id", "features")
Run Code Online (Sandbox Code Playgroud)

3)训练LDA模型

lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)  
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)
Run Code Online (Sandbox Code Playgroud)

4)复制模型

#Prepare the data set …
Run Code Online (Sandbox Code Playgroud)

lda apache-spark pyspark

5
推荐指数
1
解决办法
251
查看次数

标签 统计

apache-spark ×1

lda ×1

pyspark ×1