更好的SQL - :group vs.:select =>'DISTINCT'

cit*_*ite 3 sql postgresql activerecord ruby-on-rails

让我们假设三个模型,标准连接:

class Mailbox < ActiveRecord::Base
  has_many :addresses
  has_many :domains, :through => :addresses
end

class Address < ActiveRecord::Base
  belongs_to :mailbox
  belongs_to :domain
end

class Domain < ActiveRecord::Base
  has_many :addresses
  has_many :mailboxes, :through => :addresses
end
Run Code Online (Sandbox Code Playgroud)

现在很明显,如果对于任何给定的邮箱,您想知道它在哪个域中有地址,您有两种可能的方法:

m = Mailbox.first
# either: SELECT DISTINCT domains.id, domains.name FROM "domains" INNER JOIN 
#         "addresses" ON "domains".id = "addresses".domain_id WHERE 
#         (("addresses".mailbox_id = 1))
m.domains.all(:select => 'DISTINCT domains.id, domains.name')
# or: SELECT domains.id, domains.name FROM "domains" INNER JOIN "addresses" ON
#     "domains".id = "addresses".domain_id WHERE (("addresses".mailbox_id = 1))
#      GROUP BY domains.id, domains.name
m.domains.all(:select => 'domains.id, domains.name', 
  :group => 'domains.id, domains.name')
Run Code Online (Sandbox Code Playgroud)

对我来说问题是我不知道哪种解决方案更好.当我没有指定任何其他条件时,PostgreSQL查询规划器倾向于第二个解决方案(按预期工作),但如果我向查询添加条件,则归结为"Unique"与"Group":

使用"DISTINCT":

 Unique  (cost=16.56..16.57 rows=1 width=150)
   ->  Sort  (cost=16.56..16.56 rows=1 width=150)
         Sort Key: domains.name, domains.id
         ->  Nested Loop  (cost=0.00..16.55 rows=1 width=150)
               ->  Index Scan using index_addresses_on_mailbox_id on addresses  (cost=0.00..8.27 rows=1 width=4)
                     Index Cond: (mailbox_id = 1)
               ->  Index Scan using domains_pkey on domains  (cost=0.00..8.27 rows=1 width=150)
                     Index Cond: (domains.id = addresses.domain_id)
                     Filter: (domains.active AND domains.selfmgmt)
(9 rows)
Run Code Online (Sandbox Code Playgroud)

使用"GROUP BY":

Group  (cost=16.56..16.57 rows=1 width=150)
   ->  Sort  (cost=16.56..16.56 rows=1 width=150)
         Sort Key: domains.name, domains.id
         ->  Nested Loop  (cost=0.00..16.55 rows=1 width=150)
               ->  Index Scan using index_addresses_on_mailbox_id on addresses  (cost=0.00..8.27 rows=1 width=4)
                     Index Cond: (mailbox_id = 1)
               ->  Index Scan using domains_pkey on domains  (cost=0.00..8.27 rows=1 width=150)
                     Index Cond: (domains.id = addresses.domain_id)
                     Filter: (domains.active AND domains.selfmgmt)
(9 rows)
Run Code Online (Sandbox Code Playgroud)

我真的不确定如何确定检索这些数据的更好方法.我的直觉告诉我要使用"GROUP BY",但我找不到任何足以解决此问题的文档.

我应该使用":group"还是":select =>'DISTINCT'"?这个选择是否与其他现代RDBMS相同,例如Oracle,DB2或MySQL(我无法访问那些,所以我无法执行测试)?

小智 10

如果你正在使用Postgresql <8.4(我猜你是这样的,给定计划) - 通常更好的是使用GROUP BY而不是DISTINCT因为它的计划更有效.

在8.4中没有区别,因为DISTINCT被"教导"也能够使用群组操作员.