我们如何使用SQL-esque"LIKE"标准加入两个Spark SQL数据帧？

Question

我们如何使用SQL-esque"LIKE"标准加入两个Spark SQL数据帧？

Wil*_*man 5 python apache-spark apache-spark-sql pyspark

我们使用PySpark库与Spark 1.3.1连接.

我们有两个数据帧,documents_df := {document_id, document_text}和keywords_df := {keyword}.我们想要{document_id, keyword}使用keyword_df.ocuword出现在document_df.document_text字符串中的条件来加入两个数据帧并返回带有对的结果数据帧.

例如,在PostgreSQL中,我们可以使用表单的ON子句来实现:

document_df.document_text ilike '%' || keyword_df.keyword || '%'

但是在PySpark中,我无法使用任何形式的连接语法.以前有人有过这样的经历吗？

亲切的问候,

将

Answer 1

zer*_*323 16

它有两种不同的方式,但一般来说不推荐.首先让我们创建一个虚拟数据:

from pyspark.sql import Row

document_row = Row("document_id", "document_text")
keyword_row = Row("keyword") 

documents_df = sc.parallelize([
    document_row(1L, "apache spark is the best"),
    document_row(2L, "erlang rocks"),
    document_row(3L, "but haskell is better")
]).toDF()

keywords_df = sc.parallelize([
    keyword_row("erlang"),
    keyword_row("haskell"),
    keyword_row("spark")
]).toDF()

Run Code Online (Sandbox Code Playgroud)

Hive UDF

documents_df.registerTempTable("documents")
keywords_df.registerTempTable("keywords")

query = """SELECT document_id, keyword
    FROM documents JOIN keywords
    ON document_text LIKE CONCAT('%', keyword, '%')"""

like_with_hive_udf = sqlContext.sql(query)
like_with_hive_udf.show()

## +-----------+-------+
## |document_id|keyword|
## +-----------+-------+
## |          1|  spark|
## |          2| erlang|
## |          3|haskell|
## +-----------+-------+

Run Code Online (Sandbox Code Playgroud)

Python UDF

from pyspark.sql.functions import udf, col 
from pyspark.sql.types import BooleanType

# Of you can replace `in` with a regular expression
contains = udf(lambda s, q: q in s, BooleanType())

like_with_python_udf = (documents_df.join(keywords_df)
    .where(contains(col("document_text"), col("keyword")))
    .select(col("document_id"), col("keyword")))
like_with_python_udf.show()

## +-----------+-------+
## |document_id|keyword|
## +-----------+-------+
## |          1|  spark|
## |          2| erlang|
## |          3|haskell|
## +-----------+-------+

Run Code Online (Sandbox Code Playgroud)

为什么不推荐？因为在这两种情况下都需要笛卡尔积:

like_with_hive_udf.explain()

## TungstenProject [document_id#2L,keyword#4]
##  Filter document_text#3 LIKE concat(%,keyword#4,%)
##   CartesianProduct
##    Scan PhysicalRDD[document_id#2L,document_text#3]
##    Scan PhysicalRDD[keyword#4]

like_with_python_udf.explain()

## TungstenProject [document_id#2L,keyword#4]
##  Filter pythonUDF#13
##   !BatchPythonEvaluation PythonUDF#<lambda>(document_text#3,keyword#4), ...
##    CartesianProduct
##     Scan PhysicalRDD[document_id#2L,document_text#3]
##     Scan PhysicalRDD[keyword#4]

Run Code Online (Sandbox Code Playgroud)

如果没有完整的笛卡儿,还有其他方法可以达到类似的效果.

加入标记化文档 - 如果关键字列表很大,则需要在单个计算机的内存中处理

from pyspark.ml.feature import Tokenizer
from pyspark.sql.functions import explode

tokenizer = Tokenizer(inputCol="document_text", outputCol="words")

tokenized = (tokenizer.transform(documents_df)
    .select(col("document_id"), explode(col("words")).alias("token")))

like_with_tokenizer = (tokenized
    .join(keywords_df, col("token") == col("keyword"))
    .drop("token"))

like_with_tokenizer.show()

## +-----------+-------+
## |document_id|keyword|
## +-----------+-------+
## |          3|haskell|
## |          1|  spark|
## |          2| erlang|
## +-----------+-------+

Run Code Online (Sandbox Code Playgroud)

这需要随机播放但不是笛卡尔:

like_with_tokenizer.explain()

## TungstenProject [document_id#2L,keyword#4]
##  SortMergeJoin [token#29], [keyword#4]
##   TungstenSort [token#29 ASC], false, 0
##    TungstenExchange hashpartitioning(token#29)
##     TungstenProject [document_id#2L,token#29]
##      !Generate explode(words#27), true, false, [document_id#2L, ...
##       ConvertToSafe
##        TungstenProject [document_id#2L,UDF(document_text#3) AS words#27]
##         Scan PhysicalRDD[document_id#2L,document_text#3]
##   TungstenSort [keyword#4 ASC], false, 0
##    TungstenExchange hashpartitioning(keyword#4)
##     ConvertToUnsafe
##      Scan PhysicalRDD[keyword#4]

Run Code Online (Sandbox Code Playgroud)

Python UDF和广播变量 - 如果关键字列表相对较小

from pyspark.sql.types import ArrayType, StringType

keywords = sc.broadcast(set(
    keywords_df.map(lambda row: row[0]).collect()))

bd_contains = udf(
    lambda s: list(set(s.split()) & keywords.value), 
    ArrayType(StringType()))


like_with_bd = (documents_df.select(
    col("document_id"), 
    explode(bd_contains(col("document_text"))).alias("keyword")))

like_with_bd.show()

## +-----------+-------+
## |document_id|keyword|
## +-----------+-------+
## |          1|  spark|
## |          2| erlang|
## |          3|haskell|
## +-----------+-------+

Run Code Online (Sandbox Code Playgroud)

它既不需要shuffle也不需要Cartesian,但你仍然需要将广播变量传输到每个工作节点.

like_with_bd.explain()

## TungstenProject [document_id#2L,keyword#46]
##  !Generate explode(pythonUDF#47), true, false, ...
##   ConvertToSafe
##    TungstenProject [document_id#2L,pythonUDF#47]
##     !BatchPythonEvaluation PythonUDF#<lambda>(document_text#3), ...
##      Scan PhysicalRDD[document_id#2L,document_text#3]

Run Code Online (Sandbox Code Playgroud)

从Spark 1.6.0开始,您可以标记一个小数据帧,sql.functions.broadcast以获得与上面类似的效果,而无需使用UDF和显式广播变量.重用标记化数据:

from pyspark.sql.functions import broadcast

like_with_tokenizer_and_bd = (broadcast(tokenized)
    .join(keywords_df, col("token") == col("keyword"))
    .drop("token"))

like_with_tokenizer.explain()

## TungstenProject [document_id#3L,keyword#5]
##  BroadcastHashJoin [token#10], [keyword#5], BuildLeft
##   TungstenProject [document_id#3L,token#10]
##    !Generate explode(words#8), true, false, ...
##     ConvertToSafe
##      TungstenProject [document_id#3L,UDF(document_text#4) AS words#8]
##       Scan PhysicalRDD[document_id#3L,document_text#4]
##   ConvertToUnsafe
##    Scan PhysicalRDD[keyword#5]

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，3 月前
查看次数：	5098 次
最近记录：	8 年，4 月前