Spark数据帧连接范围慢

Question

Spark数据帧连接范围慢

Sch*_*äbo 5 java apache-spark apache-spark-sql spark-dataframe

我有一个火花作业的以下输入数据(在Parquet中):

Person (millions of rows)
+---------+----------+---------------+---------------+
|  name   | location |     start     |      end      |
+---------+----------+---------------+---------------+
| Person1 |     1230 | 1478630000001 | 1478630000010 |
| Person2 |     1230 | 1478630000002 | 1478630000012 |
| Person2 |     1230 | 1478630000013 | 1478630000020 |
| Person3 |     3450 | 1478630000001 | 1478630000015 |
+---------+----------+---------------+---------------+


Event (millions of rows)
+----------+----------+---------------+
|  event   | location |  start_time   |
+----------+----------+---------------+
| Biking   |     1230 | 1478630000005 |
| Skating  |     1230 | 1478630000014 |
| Baseball |     3450 | 1478630000015 |
+----------+----------+---------------+

Run Code Online (Sandbox Code Playgroud)

我需要将其转换为以下预期结果:

[{
    "name" : "Biking",
    "persons" : ["Person1", "Person2"]
},
{
    "name" : "Skating",
    "persons" : ["Person2"]
},
{
    "name" : "Baseball",
    "persons" : ["Person3"]
}]

Run Code Online (Sandbox Code Playgroud)

用文字表示:结果是每个事件的列表,每个事件都包含参与此事件的人员列表.

如果是,一个人算作参与者

Person.start < Event.start_time 
&& Person.end > Event.start_time
&& Person.location == Event.location

Run Code Online (Sandbox Code Playgroud)

我尝试了不同的方法,但实际上似乎唯一有用的方法是加入两个数据帧,然后按事件分组/聚合它们.但是连接速度非常慢,并且不能很好地分布在多个CPU核心上.

加入的当前代码:

final DataFrame fullFrame = persons.as("persons")
    .join(events.as("events"), col("persons.location").equalTo(col("events.location"))
               .and(col("events.start_time").geq(col("persons.start")))
               .and(col("events.start_time").leq(col("persons.end"))), "inner");

//count to have an action 
fullFrame.count();

Run Code Online (Sandbox Code Playgroud)

我正在使用Spark Standalone和Java,如果这有所不同的话.

有没有人更好地了解如何使用Spark 1.6.2解决这个问题？

Answer 1

Elm*_*cek 1

范围连接作为叉积与后续过滤步骤一起执行。一个可能更好的解决方案可能是广播可能较小的events表，然后映射该persons表：在映射内，检查连接条件并生成相应的结果。

归档时间：	9 年，2 月前
查看次数：	878 次
最近记录：	9 年，1 月前