hive sql聚合

chn*_*net 4 sql hive group-by

我在Hive中有两个表,t1t2

>describe t1;
>date_id    string

>describe t2;
>messageid string,
 createddate string,
 userid int

> select * from t1 limit 3;        
> 2011-01-01 00:00:00 
  2011-01-02 00:00:00 
  2011-01-03 00:00:00 

> select * from t2 limit 3;
87211389    2011-01-03 23:57:01 13864753
87211656    2011-01-03 23:57:59 13864769
87211746    2011-01-03 23:58:25 13864785
Run Code Online (Sandbox Code Playgroud)

我想要的是计算给定日期的前三天不同用户ID.
例如,对于日期2011-01-03,我想从数不同用户ID 2011-01-012011-01-03.
日期2011-01-04,我想指望从不同的用户ID 2011-01-02,以2011-01-04

我写了以下查询.但它没有返回三天的结果.它每天返回不同的用户ID.

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

`to_date()` and `date_sub()` are date function in Hive. 
Run Code Online (Sandbox Code Playgroud)

也就是说,以下部分不生效.

WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
Run Code Online (Sandbox Code Playgroud)

编辑:一个解决方案可以(但它是超级慢):

SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
 SELECT * FROM t1  LEFT OUTER JOIN t2
 WHERE 
 (date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
  AND to_date(t2.createddate) <= to_date(t1.date_id)
 )
) t3 
GROUP by to_date(t3.date_id);
Run Code Online (Sandbox Code Playgroud)

更新:感谢所有答案.他们很好.
但是Hive与SQL有点不同.不幸的是,他们不能用于HIVE.我目前的解决方案是使用UNION ALL.

 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
 UNION ALL
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
 UNION ALL 
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)
Run Code Online (Sandbox Code Playgroud)

然后,我做的group bycount.通过这种方式,我可以得到我想要的东西.
虽然它不优雅,但效率却高cross join.

Mat*_*lie 11

以下应该似乎适用于标准SQL ...

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON  to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
    AND to_date(t2.createddate) <  date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)
Run Code Online (Sandbox Code Playgroud)

,但是,很慢.因为您将日期存储为字符串,所以使用to_date()将它们转换为日期.这意味着索引不能被使用,并且SQL引擎不能做任何聪明的事情来减少花费的精力.

因此,需要比较每个可能的行组合.如果T1中有100个条目,T2中有10,000个条目,则SQL引擎处理一百万个组合.

如果将这些值存储为日期,则不需要to_date().如果您对日期编制索引,则SQL引擎可以在指定的日期范围内快速回家.

注意:该ON子句的格式意味着您不需要t2.createddate向下舍入到每日值.


编辑 为什么你的代码不起作用......

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);
Run Code Online (Sandbox Code Playgroud)

这种连接T1与到t2 ON的条款(to_date(t2.createddate) = to_date(t1.date_id)).由于连接是LEFT OUTER JOIN,因此t2.createddateMUST中的值现在要么为NULL(不匹配),要么相同t1.date_id.

WHERE条款允许更广泛的范围(3天).但该ON条款JOIN已经将您的数据限制在一天之内.

我上面给出的例子只是简单地使用你的WHERE子句并将它放在旧ON句子的位置.

编辑

Hive不允许<=>=在ON条款中?你真的坚持使用HIVE ???

如果你真的是,BETWEEN怎么样?

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)
Run Code Online (Sandbox Code Playgroud)


或者,重构您的日期表以列举您想要包含的日期......

TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
  2011-01-03, 2011-01-02
  2011-01-03, 2011-01-03

  2011-01-04, 2011-01-02
  2011-01-04, 2011-01-03
  2011-01-04, 2011-01-04

  2011-01-05, 2011-01-03
  2011-01-05, 2011-01-04
  2011-01-05, 2011-01-05 }

SELECT
  to_date(t1.calendar_date),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
  to_date(t1.calendar_date)
Run Code Online (Sandbox Code Playgroud)