代码的最后部分:
lda = LdaModel(corpus=corpus,id2word=dictionary, num_topics=2)
print lda
Run Code Online (Sandbox Code Playgroud)
bash输出:
INFO : adding document #0 to Dictionary(0 unique tokens)
INFO : built Dictionary(18 unique tokens) from 5 documents (total 20 corpus positions)
INFO : using serial LDA version on this node
INFO : running online LDA training, 2 topics, 1 passes over the supplied corpus of 5 documents, updating model once every 5 documents
WARNING : too few updates, training might not converge; consider increasing the number of passes to improve accuracy …Run Code Online (Sandbox Code Playgroud) 所以我的输入数据有两个字段/列:id1和id2,我的代码如下:
TextLine(args("input"))
.read
.mapTo('line->('id1,'id2)) {line: String =>
val fields = line.split("\t")
(fields(0),fields(1))
}
.groupBy('id2){.size}
.write(Tsv(args("output")))
Run Code Online (Sandbox Code Playgroud)
输出结果(我假设)两个字段:id2*size.我有点坚持找出是否可以保留id2值并将其与id2分组并将其添加为另一个字段?