dat*_*ess 4 hadoop hive join hiveql
Hive的连接文档鼓励使用隐式连接,即
SELECT *
FROM table1 t1, table2 t2, table3 t3
WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '02535';
Run Code Online (Sandbox Code Playgroud)
这是否相当于
SELECT t1.*, t2.*, t3.*
FROM table1 t1
INNER JOIN table2 t2 ON
t1.id = t2.id
INNER JOIN table3 t3 ON
t2.id = t3.id
WHERE t1.zipcode = '02535'
Run Code Online (Sandbox Code Playgroud)
,或者上面会返回额外的记录?
不总是。您的查询是等效的。但没有WHERE t1.id = t2.id AND t2.id = t3.id它就会CROSS JOIN。
更新:
这是一个有趣的问题,我决定添加一些演示。让我们创建两个表:
A(c1 int, c2 string)和B(c1 int, c2 string)。
加载数据:
insert into table A
select 1, 'row one' union all
select 2, 'row two';
insert into table B
select 1, 'row one' union all
select 3, 'row three';
Run Code Online (Sandbox Code Playgroud)
检查数据:
hive> select * from A;
OK
1 row one
2 row two
Time taken: 1.29 seconds, Fetched: 2 row(s)
hive> select * from B;
OK
1 row one
3 row three
Time taken: 0.091 seconds, Fetched: 2 row(s)
Run Code Online (Sandbox Code Playgroud)
检查交叉连接(隐式连接而不where转换为交叉):
hive> select a.c1, a.c2, b.c1, b.c2 from a,b;
Warning: Map Join MAPJOIN[14][bigTable=a] in task 'Stage-3:MAPRED' is a cross product
Warning: Map Join MAPJOIN[22][bigTable=b] in task 'Stage-4:MAPRED' is a cross product
Warning: Shuffle Join JOIN[4][tables = [a, b]] in Stage 'Stage-1:MAPRED' is a cross product
OK
1 row one 1 row one
2 row two 1 row one
1 row one 3 row three
2 row two 3 row three
Time taken: 54.804 seconds, Fetched: 4 row(s)
Run Code Online (Sandbox Code Playgroud)
检查内连接(隐式连接where作为 INNER 工作):
hive> select a.c1, a.c2, b.c1, b.c2 from a,b where a.c1=b.c1;
OK
1 row one 1 row one
Time taken: 38.413 seconds, Fetched: 1 row(s)
Run Code Online (Sandbox Code Playgroud)
尝试通过添加OR b.c1 is null到 where来执行左连接:
hive> select a.c1, a.c2, b.c1, b.c2 from a,b where (a.c1=b.c1) OR (b.c1 is null);
OK
1 row one 1 row one
Time taken: 57.317 seconds, Fetched: 1 row(s)
Run Code Online (Sandbox Code Playgroud)
如您所见,我们再次获得了内连接。or b.c1 is null被忽略
现在left join没有whereandON子句(转换为 CROSS):
select a.c1, a.c2, b.c1, b.c2 from a left join b;
OK
1 row one 1 row one
1 row one 3 row three
2 row two 1 row one
2 row two 3 row three
Time taken: 37.104 seconds, Fetched: 4 row(s)
Run Code Online (Sandbox Code Playgroud)
正如你所看到的,我们再次交叉。
尝试使用带where子句和不带子句的左连接ON(用作内部):
select a.c1, a.c2, b.c1, b.c2 from a left join b where a.c1=b.c1;
OK
1 row one 1 row one
Time taken: 40.617 seconds, Fetched: 1 row(s)
Run Code Online (Sandbox Code Playgroud)
我们得到了 INNER 加入
尝试 left join withwhere子句,而 without ON+ 尝试允许空值:
select a.c1, a.c2, b.c1, b.c2 from a left join b where a.c1=b.c1 or b.c1 is null;
OK
1 row one 1 row one
Time taken: 53.873 seconds, Fetched: 1 row(s)
Run Code Online (Sandbox Code Playgroud)
再次获得了内在。或被b.c1 is null忽略。
左连接on子句:
hive> select a.c1, a.c2, b.c1, b.c2 from a left join b on a.c1=b.c1;
OK
1 row one 1 row one
2 row two NULL NULL
Time taken: 48.626 seconds, Fetched: 2 row(s)
Run Code Online (Sandbox Code Playgroud)
是的,这是真正的左连接。
左连接on+ where(得到 INNER):
hive> select a.c1, a.c2, b.c1, b.c2 from a left join b on a.c1=b.c1 where a.c1=b.c1;
OK
1 row one 1 row one
Time taken: 49.54 seconds, Fetched: 1 row(s)
Run Code Online (Sandbox Code Playgroud)
我们得到 INNER 是因为 WHERE 不允许 NULLS。
左连接 where + 允许空值:
hive> select a.c1, a.c2, b.c1, b.c2 from a left join b on a.c1=b.c1 where a.c1=b.c1 or b.c1 is null;
OK
1 row one 1 row one
2 row two NULL NULL
Time taken: 55.951 seconds, Fetched: 2 row(s)
Run Code Online (Sandbox Code Playgroud)
是的,它是左连接。
结论: