火花中的哈希函数

Question

火花中的哈希函数

Via*_*mov 6 hash scala apache-spark apache-spark-sql

我正在尝试向数据框添加一列，该数据框将包含另一列的哈希。

我找到了以下文档：https : //spark.apache.org/docs/2.3.0/api/sql/index.html#hash
并尝试了以下操作：

import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))

Run Code Online (Sandbox Code Playgroud)

但是，那使用的哈希函数是hash()什么？那是murmur，sha，md5，别的东西吗？

我在此列中获得的值是整数，因此此处的值范围可能是[-2^(31) ... +2^(31-1)]。
我可以在这里获得长期价值吗？我可以代替字符串哈希吗？
如何为此指定一种具体的哈希算法？
我可以使用自定义哈希函数吗？

Answer 1

Gal*_*ses 7

如果你想要一个长哈希，在spark 3中有这个xxhash64函数：https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64。

您可能只需要正数。在这种情况下，您可以使用hashand sumInt.MaxValue作为

df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()

Run Code Online (Sandbox Code Playgroud)

@shasu，抱歉，但您所问的问题与该页面的问题无关。请提出一个新的 stackoverflow 问题 (2认同)

Answer 2

Wil*_*ill 5

它是基于源代码的Murmur。

  /**
   * Calculates the hash code of given columns, and returns the result as an int column.
   *
   * @group misc_funcs
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def hash(cols: Column*): Column = withExpr {
    new Murmur3Hash(cols.map(_.expr))
  }

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，2 月前
查看次数：	3526 次
最近记录：	6 年，8 月前