我在Hive中有两个表,t1和t2
>describe t1;
>date_id string
>describe t2;
>messageid string,
createddate string,
userid int
> select * from t1 limit 3;
> 2011-01-01 00:00:00
2011-01-02 00:00:00
2011-01-03 00:00:00
> select * from t2 limit 3;
87211389 2011-01-03 23:57:01 13864753
87211656 2011-01-03 23:57:59 13864769
87211746 2011-01-03 23:58:25 13864785
Run Code Online (Sandbox Code Playgroud)
我想要的是计算给定日期的前三天不同用户ID.
例如,对于日期2011-01-03,我想从数不同用户ID 2011-01-01来2011-01-03.
日期2011-01-04,我想指望从不同的用户ID 2011-01-02,以2011-01-04
我写了以下查询.但它没有返回三天的结果.它每天返回不同的用户ID.
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
`to_date()` and `date_sub()` are date function in Hive.
Run Code Online (Sandbox Code Playgroud)
也就是说,以下部分不生效.
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
Run Code Online (Sandbox Code Playgroud)
编辑:一个解决方案可以(但它是超级慢):
SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
SELECT * FROM t1 LEFT OUTER JOIN t2
WHERE
(date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
)
) t3
GROUP by to_date(t3.date_id);
Run Code Online (Sandbox Code Playgroud)
更新:感谢所有答案.他们很好.
但是Hive与SQL有点不同.不幸的是,他们不能用于HIVE.我目前的解决方案是使用UNION ALL.
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)
Run Code Online (Sandbox Code Playgroud)
然后,我做的group by和count.通过这种方式,我可以得到我想要的东西.
虽然它不优雅,但效率却高cross join.
Mat*_*lie 11
以下应该似乎适用于标准SQL ...
SELECT
to_date(t1.date_id),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
AND to_date(t2.createddate) < date_add(to_date(t1.date_id), 1)
GROUP BY
to_date(t1.date_id)
Run Code Online (Sandbox Code Playgroud)
这会,但是,很慢.因为您将日期存储为字符串,所以使用to_date()将它们转换为日期.这意味着索引不能被使用,并且SQL引擎不能做任何聪明的事情来减少花费的精力.
因此,需要比较每个可能的行组合.如果T1中有100个条目,T2中有10,000个条目,则SQL引擎处理一百万个组合.
如果将这些值存储为日期,则不需要to_date().如果您对日期编制索引,则SQL引擎可以在指定的日期范围内快速回家.
注意:该ON子句的格式意味着您不需要t2.createddate向下舍入到每日值.
编辑 为什么你的代码不起作用......
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
Run Code Online (Sandbox Code Playgroud)
这种连接T1与到t2 ON的条款(to_date(t2.createddate) = to_date(t1.date_id)).由于连接是LEFT OUTER JOIN,因此t2.createddateMUST中的值现在要么为NULL(不匹配),要么相同t1.date_id.
该WHERE条款允许更广泛的范围(3天).但该ON条款JOIN已经将您的数据限制在一天之内.
我上面给出的例子只是简单地使用你的WHERE子句并将它放在旧ON句子的位置.
编辑
Hive不允许<=和>=在ON条款中?你真的坚持使用HIVE ???
如果你真的是,BETWEEN怎么样?
SELECT
to_date(t1.date_id),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
to_date(t1.date_id)
Run Code Online (Sandbox Code Playgroud)
或者,重构您的日期表以列举您想要包含的日期......
TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
2011-01-03, 2011-01-02
2011-01-03, 2011-01-03
2011-01-04, 2011-01-02
2011-01-04, 2011-01-03
2011-01-04, 2011-01-04
2011-01-05, 2011-01-03
2011-01-05, 2011-01-04
2011-01-05, 2011-01-05 }
SELECT
to_date(t1.calendar_date),
count(distinct t2.userid)
FROM
t1
LEFT JOIN
t2
ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
to_date(t1.calendar_date)
Run Code Online (Sandbox Code Playgroud)