如何在PostgreSQL查询中排序不同的元组

Wil*_*ill 4 sql postgresql distinct-on

我正在尝试在Postgres中提交一个只返回不同元组的查询.在我的示例查询中,我不希望对于cluster_id/feed_id组合多次存在条目的重复条目.如果我做一个简单的事:

select distinct on (cluster_info.cluster_id, feed_id) 
   cluster_info.cluster_id, num_docs, feed_id, url_time 
   from url_info 
   join cluster_info on (cluster_info.cluster_id = url_info.cluster_id) 
   where feed_id in (select pot_seeder from potentials) 
   and num_docs > 5 and url_time > '2012-04-16';
Run Code Online (Sandbox Code Playgroud)

我得到了那个,但我也想按照分组num_docs.所以,当我做以下事情时:

select distinct on (cluster_info.cluster_id, feed_id) 
   cluster_info.cluster_id, num_docs, feed_id, url_time 
   from url_info join cluster_info 
   on (cluster_info.cluster_id = url_info.cluster_id) 
   where feed_id in (select pot_seeder from potentials) 
   and num_docs > 5 and url_time > '2012-04-16' 
   order by num_docs desc;
Run Code Online (Sandbox Code Playgroud)

我收到以下错误:

ERROR:  SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: select distinct on (cluster_info.cluster_id, feed_id) cluste...
Run Code Online (Sandbox Code Playgroud)

我想我理解为什么我会收到错误(除非我以某种方式明确描述该组,否则不能通过元组进行分组)但是我该怎么做?或者,如果我对错误的解释不正确,有没有办法实现我的初始目标?

Erw*_*ter 11

最左边的ORDER BY项目不能与该DISTINCT条款的项目不一致.我引用手册DISTINCT:

DISTINCT ON表达式(一个或多个)必须最左边的匹配ORDER BY 表达式(一个或多个).ORDER BY子句通常包含其他表达式,用于确定每个DISTINCT ON组中行的所需优先级.

尝试:

SELECT *
FROM  (
    SELECT DISTINCT ON (c.cluster_id, feed_id) 
           c.cluster_id, num_docs, feed_id, url_time 
    FROM   url_info u
    JOIN   cluster_info c ON (c.cluster_id = u.cluster_id) 
    WHERE  feed_id IN (SELECT pot_seeder FROM potentials) 
    AND    num_docs > 5
    AND    url_time > '2012-04-16'
    ORDER  BY c.cluster_id, feed_id, num_docs, url_time
           -- first columns match DISTINCT
           -- the rest to pick certain values for dupes
           -- or did you want to pick random values for dupes?
    ) x
ORDER  BY num_docs DESC;
Run Code Online (Sandbox Code Playgroud)

或使用GROUP BY:

SELECT c.cluster_id
     , num_docs
     , feed_id
     , url_time 
FROM   url_info u
JOIN   cluster_info c ON (c.cluster_id = u.cluster_id) 
WHERE  feed_id IN (SELECT pot_seeder FROM potentials) 
AND    num_docs > 5
AND    url_time > '2012-04-16'
GROUP  BY c.cluster_id, feed_id 
ORDER  BY num_docs DESC;
Run Code Online (Sandbox Code Playgroud)

如果c.cluster_id, feed_idSELECT列表中包含列的所有(在本例中都是)表的主键列,那么这只适用于PostgreSQL 9.1或更高版本.

否则,您需要GROUP BY其他列或聚合或提供更多信息.