Sea*_*yen 21 apache-spark apache-spark-sql pyspark
我试图在spark(1.6.2)中进行左外连接,但它不起作用.我的SQL查询是这样的:
sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p
ON t.uuid = p.uuid
where t.created_year = 2016
and p.created_year = 2016").show()
Run Code Online (Sandbox Code Playgroud)
结果是这样的:
+--------------------+--------------------+--------------------+
| type| uuid| uuid|
+--------------------+--------------------+--------------------+
| tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
| swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|
Run Code Online (Sandbox Code Playgroud)
我使用LEFT JOIN或LEFT OUTER JOIN得到了相同的结果(第二个uuid不为null).
我希望第二个uuid列只能为null.如何正确地进行左外连接?
===其他信息==
如果我使用数据帧做左外连接我得到了正确的结果.
s = sqlCtx.sql('select * from symptom_type where created_year = 2016')
p = sqlCtx.sql('select * from plugin where created_year = 2016')
s.join(p, s.uuid == p.uuid, 'left_outer')
.select(s.type, s.uuid.alias('s_uuid'),
p.uuid.alias('p_uuid'), s.created_date, p.created_year, p.created_month).show()
Run Code Online (Sandbox Code Playgroud)
我有这样的结果:
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| type| s_uuid| p_uuid| created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
Run Code Online (Sandbox Code Playgroud)
谢谢,
Arv*_*mar 34
我没有在您的代码中看到任何问题."左连接"或"左外连接"都可以正常工作.请再次检查数据您显示的数据是否匹配.
您还可以使用以下命令执行Spark SQL连接:
//左外连接显式
df1.join(df2, df1("col1") === df2("col1"), "left_outer")
Run Code Online (Sandbox Code Playgroud)
您正在使用过滤掉p.created_year(和p.uuid)的空值
where t.created_year = 2016
and p.created_year = 2016
Run Code Online (Sandbox Code Playgroud)
避免这种情况的方法是将过滤子句p移至ON语句:
sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p
ON t.uuid = p.uuid
and p.created_year = 2016
where t.created_year = 2016").show()
Run Code Online (Sandbox Code Playgroud)
这是正确的,但是效率很低,因为我们还需要t.created_year在连接发生之前进行过滤。因此建议使用子查询:
sqlContext.sql("select t.type, t.uuid, p.uuid
from (
SELECT type, uuid FROM symptom_type WHERE created_year = 2016
) t LEFT JOIN (
SELECT uuid FROM plugin WHERE created_year = 2016
) p
ON t.uuid = p.uuid").show()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
63083 次 |
| 最近记录: |