编写此查询的更好方法是什么?

H.R*_*imy 5 postgresql performance subquery relational-division postgresql-performance

我的数据库(Postgresql 10)具有以下架构:

CREATE TABLE "PulledTexts" (
"Id" serial PRIMARY KEY,
"BaseText" TEXT,
"CleanText" TEXT
);

CREATE TABLE "UniqueWords" (
"Id" serial PRIMARY KEY,
"WordText" TEXT
);

CREATE TABLE "WordTexts" (
"Id" serial PRIMARY KEY,
"TextIdId" INTEGER REFERENCES "PulledTexts",
"WordIdId" INTEGER REFERENCES "UniqueWords"
);
CREATE INDEX "IX_WordTexts_TextIdId" ON "WordTexts" ("TextIdId");
CREATE INDEX "IX_WordTexts_WordIdId" ON "WordTexts" ("WordIdId");
Run Code Online (Sandbox Code Playgroud)

一些示例数据:

INSERT INTO public."PulledTexts" ("Id", "BaseText", "CleanText") VALUES
(1, 'automate business audit', null),
(2, 'audit trial', null),
(3, 'trial', null),
(4, 'audit', null),
(5, 'fresh report', null),
(6, 'fresh audit', null),
(7, 'automate this script', null),
(8, 'im trying here', null),
(9, 'automate this business', null),
(10, 'lateral', null);

INSERT INTO public."UniqueWords" ("Id", "WordText") VALUES
(1, 'trial'),
(2, 'audit'),
(3, 'creation'),
(4, 'business'),
(5, 'automate');

INSERT INTO public."WordTexts" ("Id", "TextIdId", "WordIdId") VALUES
(1, 1, 2),
(2, 1, 4),
(3, 1, 5),
(4, 2, 1),
(5, 3, 1),
(6, 4, 2),
(7, 6, 2),
(8, 7, 5),
(9, 9, 4),
(10, 9, 5),
(11, 2, 2);
Run Code Online (Sandbox Code Playgroud)

数据库本身目前是通过实体框架迁移创建的。

我想知道是否有更好的,特别是更高效的方式来编写此查询,因为该WordTexts表将包含数十万条记录,最终将包含数百万条记录。如果这对这些类型的查询更有效,我也愿意采用 NoSql 路由。

SELECT *
FROM "PulledTexts"
WHERE "Id" IN (
 SELECT "TextIdId"
 FROM "WordTexts" AS "wordTexts"
 LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
 WHERE "wordTexts.WordId"."WordText" = 'automate'

 OR "TextIdId" IN (
  SELECT "TextIdId" and1
  from "WordTexts" AS "wordTexts"
  LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
  where "wordTexts.WordId"."WordText" = 'audit' INTERSECT

  SELECT "TextIdId" and2
  from "WordTexts" AS "wordTexts"
  LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
  WHERE "wordTexts.WordId"."WordText" = 'trial'
 )
);
Run Code Online (Sandbox Code Playgroud)

在较高级别,此查询应该返回其 ID 与子查询返回的 ID 匹配的 PulledText。子查询只是为了返回包含('audit' AND 'trial') OR 'automate'来自 UniqueWords 表的说的 PulledTexts 的 ID 列表。这正是我给出的示例查询所做的。WordTexts 表是 UniqueWord 到 PulledText 的简单映射。

Erw*_*ter 6

虽然您的查询有效,但我有很多不同之处。

  1. 如果可以避免,请不要在 Postgres 中使用 CaMeL-case 名称。您的未命名实体框架可能会强迫您进行这种胡说八道,但我不想处理双引号的混乱,因此我在删除所有双引号后对您的架构进行了测试 - 有效地使所有标识符都变为小写。

  2. 不要使用难以辨认或非法的列和表别名(如"wordTexts.WordId")。这是品味和风格(和理智)的问题,但您也省略了AS应该使用的关键字并将其保留在可以省略的地方。

  3. 我还格式化了更多内容,以便我更轻松地理解查询。最后一部分是完全可选的。但是使用一些一致的格式样式。

到达这里

SELECT *
FROM   PulledTexts
WHERE  Id IN (
   SELECT w.TextIdId
   FROM   WordTexts w  -- AS can be omitted for table alias
   LEFT   JOIN UniqueWords u ON w.WordIdId = u.Id  -- LEFT JOIN might be necessary here
   WHERE  u.WordText = 'automate'
   OR     w.TextIdId IN (
      SELECT w.TextIdId  -- AS and1 -- column alias only documentation here, not visible
      FROM   WordTexts w
      JOIN   UniqueWords u ON w.WordIdId = u.Id  -- LEFT JOIN misleading here
      WHERE  u.WordText = 'audit'

      INTERSECT
      SELECT w.TextIdId  -- AS and2  -- but don't omit AS for column alias
      FROM   WordTexts w
      JOIN   UniqueWords u ON w.WordIdId = u.Id
      WHERE  u.WordText = 'trial'
      )
   );
Run Code Online (Sandbox Code Playgroud)

可以简化为:

SELECT *
FROM  (
   SELECT w.TextIdId AS Id
   FROM   WordTexts   w
   JOIN   UniqueWords u ON w.WordIdId = u.Id  -- now we don't need LEFT any more
   WHERE  u.WordText = 'automate'

   UNION
   SELECT w.TextIdId
   FROM   WordTexts w
   JOIN   UniqueWords u ON w.WordIdId = u.Id
   WHERE  u.WordText = 'audit'

   INTERSECT
   SELECT w.TextIdId
   FROM   WordTexts w
   JOIN   UniqueWords u ON w.WordIdId = u.Id
   WHERE  u.WordText = 'trial'
   ) w
JOIN   PulledTexts p USING (Id)
Run Code Online (Sandbox Code Playgroud)

我们不需要额外的括号,因为根据手册

INTERSECT比 结合得更紧密UNION。也就是说,A UNION B INTERSECT C将被读作A UNION (B INTERSECT C)

但是在替换多个相交的子查询时,这会更快:

SELECT *
FROM  (
   SELECT w.TextIdId AS Id
   FROM   WordTexts   w
   JOIN   UniqueWords u ON w.WordIdId = u.Id
   WHERE  u.WordText = 'automate'

   UNION
   SELECT TextIdId
   FROM   WordTexts w1
   JOIN   WordTexts w2 USING (TextIdId)
   WHERE  w1.WordIdId = (SELECT Id FROM UniqueWords WHERE WordText = 'audit')
   AND    w2.WordIdId = (SELECT Id FROM UniqueWords WHERE WordText = 'trial')
   ) w
JOIN   PulledTexts p USING (Id)
Run Code Online (Sandbox Code Playgroud)

INTERSECT部分可以转换为关系划分问题。昨天的相关答案中的解释:

db<>在这里摆弄

对性能来说最重要的是拥有正确的索引。您可能应该对in table有一个UNIQUE约束,它按此顺序在这两列上实现当前缺少的索引。(WordIdId, TextIdId)WordTexts