Wil*_*ill 4 sql postgresql distinct-on
我正在尝试在Postgres中提交一个只返回不同元组的查询.在我的示例查询中,我不希望对于cluster_id/feed_id组合多次存在条目的重复条目.如果我做一个简单的事:
select distinct on (cluster_info.cluster_id, feed_id)
cluster_info.cluster_id, num_docs, feed_id, url_time
from url_info
join cluster_info on (cluster_info.cluster_id = url_info.cluster_id)
where feed_id in (select pot_seeder from potentials)
and num_docs > 5 and url_time > '2012-04-16';
Run Code Online (Sandbox Code Playgroud)
我得到了那个,但我也想按照分组num_docs.所以,当我做以下事情时:
select distinct on (cluster_info.cluster_id, feed_id)
cluster_info.cluster_id, num_docs, feed_id, url_time
from url_info join cluster_info
on (cluster_info.cluster_id = url_info.cluster_id)
where feed_id in (select pot_seeder from potentials)
and num_docs > 5 and url_time > '2012-04-16'
order by num_docs desc;
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: select distinct on (cluster_info.cluster_id, feed_id) cluste...
Run Code Online (Sandbox Code Playgroud)
我想我理解为什么我会收到错误(除非我以某种方式明确描述该组,否则不能通过元组进行分组)但是我该怎么做?或者,如果我对错误的解释不正确,有没有办法实现我的初始目标?
Erw*_*ter 11
最左边的ORDER BY项目不能与该DISTINCT条款的项目不一致.我引用手册DISTINCT:
的
DISTINCT ON表达式(一个或多个)必须最左边的匹配ORDER BY表达式(一个或多个).ORDER BY子句通常包含其他表达式,用于确定每个DISTINCT ON组中行的所需优先级.
尝试:
SELECT *
FROM (
SELECT DISTINCT ON (c.cluster_id, feed_id)
c.cluster_id, num_docs, feed_id, url_time
FROM url_info u
JOIN cluster_info c ON (c.cluster_id = u.cluster_id)
WHERE feed_id IN (SELECT pot_seeder FROM potentials)
AND num_docs > 5
AND url_time > '2012-04-16'
ORDER BY c.cluster_id, feed_id, num_docs, url_time
-- first columns match DISTINCT
-- the rest to pick certain values for dupes
-- or did you want to pick random values for dupes?
) x
ORDER BY num_docs DESC;
Run Code Online (Sandbox Code Playgroud)
或使用GROUP BY:
SELECT c.cluster_id
, num_docs
, feed_id
, url_time
FROM url_info u
JOIN cluster_info c ON (c.cluster_id = u.cluster_id)
WHERE feed_id IN (SELECT pot_seeder FROM potentials)
AND num_docs > 5
AND url_time > '2012-04-16'
GROUP BY c.cluster_id, feed_id
ORDER BY num_docs DESC;
Run Code Online (Sandbox Code Playgroud)
如果c.cluster_id, feed_id 是SELECT列表中包含列的所有(在本例中都是)表的主键列,那么这只适用于PostgreSQL 9.1或更高版本.
否则,您需要GROUP BY其他列或聚合或提供更多信息.