小编Yer*_*yev的帖子

Spark HashingTF如何工作

我是Spark 2的新手。我尝试过Spark tfidf示例

sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark")
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)


hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)

for each in featurizedData.collect():
    print(each)
Run Code Online (Sandbox Code Playgroud)

它输出

Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))
Run Code Online (Sandbox Code Playgroud)

我希望rawFeatures我能得到像这样的词频{0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}。因为术语频率是:

tf(w) = (Number of times the word appears in a document) / (Total number of words in …
Run Code Online (Sandbox Code Playgroud)

tf-idf apache-spark pyspark apache-spark-ml apache-spark-mllib

5
推荐指数
1
解决办法
1124
查看次数