Spark HashingTF如何工作

Yer*_*yev 5 tf-idf apache-spark pyspark apache-spark-ml apache-spark-mllib

我是Spark 2的新手。我尝试过Spark tfidf示例

sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark")
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)


hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)

for each in featurizedData.collect():
    print(each)
Run Code Online (Sandbox Code Playgroud)

它输出

Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))
Run Code Online (Sandbox Code Playgroud)

我希望rawFeatures我能得到像这样的词频{0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}。因为术语频率是:

tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)
Run Code Online (Sandbox Code Playgroud)

在我们的例子中是:tf(w) = 1/5 = 0.2对于每个单词,因为每个单词在文档中重复一次。如果我们假设输出rawFeatures字典包含单词索引作为键,而文档中单词出现的次数作为值,那么为什么键1等于3.0?没有单词出现在文档中3次。这让我感到困惑。我想念什么?

zer*_*323 4

TL;博士; 这只是一个简单的哈希冲突。HashingTF需要hash(word) % numBuckets确定存储桶,并且像这里这样的存储桶数量非常少,预计会发生冲突。一般来说,您应该使用更多数量的存储桶,或者,如果冲突是不可接受的,则应使用更多数量的存储桶CountVectorizer

详细。HashingTF默认情况下使用 Murmur 哈希。[u'hi', u'i', u'heard', u'about', u'spark']将被散列到[-537608040, -1265344671, 266149357, 146891777, 2101843105]. 如果您遵循源代码,您将看到该实现等效于:

import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes

Seq("hi", "i", "heard", "about", "spark")
  .map(UTF8String.fromString(_))
  .map(utf8 => 
    hashUnsafeBytes(utf8.getBaseObject, utf8.getBaseOffset, utf8.numBytes, 42))
Run Code Online (Sandbox Code Playgroud)
Seq[Int] = List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
Run Code Online (Sandbox Code Playgroud)

当您对这些值取非负模[24, 1, 13, 1, 1]时,您将得到:

List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
  .map(nonNegativeMod(_, 32))
Run Code Online (Sandbox Code Playgroud)
List[Int] = List(24, 1, 13, 1, 1)
Run Code Online (Sandbox Code Playgroud)

列表中的三个单词(i、about 和 Spark)散列到同一个存储桶,每个单词出现一次,因此得到的结果。

有关的: