不能索引大于缓冲区页 1/3 的值

Question

不能索引大于缓冲区页 1/3 的值

我对 DB 不太好，所以请多多包涵。

我试图将一个很长的 JSON 数据放入一个表中，该表是由 Django 框架创建的。

我在 Heroku 上使用 Postgres。因此，当我尝试放置数据时，出现以下错误：

File "/app/.heroku/python/lib/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
psycopg2.OperationalError: index row size 3496 exceeds maximum 2712 for index "editor_contentmodel_content_2192f49c_uniq"
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Run Code Online (Sandbox Code Playgroud)

我的数据库和表看起来像这样：

gollahalli-me-django-test::DATABASE=> \dt
                      List of relations
 Schema |            Name            | Type  |     Owner
--------+----------------------------+-------+----------------
 public | auth_group                 | table | ffnyjettujyfck
 public | auth_group_permissions     | table | ffnyjettujyfck
 public | auth_permission            | table | ffnyjettujyfck
 public | auth_user                  | table | ffnyjettujyfck
 public | auth_user_groups           | table | ffnyjettujyfck
 public | auth_user_user_permissions | table | ffnyjettujyfck
 public | django_admin_log           | table | ffnyjettujyfck
 public | django_content_type        | table | ffnyjettujyfck
 public | django_migrations          | table | ffnyjettujyfck
 public | django_session             | table | ffnyjettujyfck
 public | editor_contentmodel        | table | ffnyjettujyfck
(11 rows)


gollahalli-me-django-test::DATABASE=> \d+ editor_contentmodel
                            Table "public.editor_contentmodel"
  Column   |           Type           | Modifiers | Storage  | Stats target | Description
-----------+--------------------------+-----------+----------+--------------+-------------
 ref_id    | character varying(120)   | not null  | extended |              |
 content   | text                     | not null  | extended |              |
 timestamp | timestamp with time zone | not null  | plain    |              |
Indexes:
    "editor_contentmodel_pkey" PRIMARY KEY, btree (ref_id)
    "editor_contentmodel_content_2192f49c_uniq" UNIQUE CONSTRAINT, btree (content, ref_id)
    "editor_contentmodel_ref_id_8f74b4f3_like" btree (ref_id varchar_pattern_ops)

Run Code Online (Sandbox Code Playgroud)

看起来我必须改变"editor_contentmodel_content_2192f49c_uniq" UNIQUE CONSTRAINT, btree (content, ref_id)才能接受md5(content)

谁能帮我这个？我不知道该怎么做。

更新：

JSON内容 - https://gist.github.com/akshaybabloo/0b3dc1fb4d964b10d09ccd6884fe3a40

更新 2：

我创建了以下UNIQUE索引，我应该删除什么？

gollahalli_me_django=> create unique index on editor_contentmodel (ref_id, md5(content::text));
CREATE INDEX
gollahalli_me_django=> \d editor_contentmodel;
        Table "public.editor_contentmodel"
  Column   |           Type           | Modifiers
-----------+--------------------------+-----------
 ref_id    | character varying(120)   | not null
 content   | jsonb                    | not null
 timestamp | timestamp with time zone | not null
Indexes:
    "editor_contentmodel_pkey" PRIMARY KEY, btree (ref_id)
    "editor_contentmodel_content_2192f49c_uniq" UNIQUE CONSTRAINT, btree (content, ref_id) <---- 1
    "editor_contentmodel_ref_id_md5_idx" UNIQUE, btree (ref_id, md5(content::text))
    "editor_contentmodel_ref_id_8f74b4f3_like" btree (ref_id varchar_pattern_ops) <----2

Run Code Online (Sandbox Code Playgroud)

我应该删除1或2（见箭头）？

Answer 1

Eva*_*oll 8

您在上有一个 UNIQUE 索引(content, ref_id)，称为editor_contentmodel_content_2192f49c_uniq

"editor_contentmodel_content_2192f49c_uniq" UNIQUE CONSTRAINT, btree (content, ref_id)

Run Code Online (Sandbox Code Playgroud)

我不确定为什么要从这里开始。因此，让我们退后一步，解决它的作用。这确保content, 和ref_id是唯一的。然而，在 PostgreSQL 中，UNIQUE约束是用 btree 实现的，这使得这是一个糟糕的解决方案。使用这种方法，您将创建一个 btree，其内容基本上复制了这个小表的大小，并形成了一个巨大的索引。一个巨大的索引仍然受到内容大小的限制 - 正如您所发现的。它提出了几个问题

你关心内容是独一无二的吗？如果您确实关心 ref_id 的内容是唯一的，那么您可能想要的是存储内容的哈希值。就像是..
```
CREATE TABLE foo ( ref_id int, content text );
CREATE UNIQUE INDEX ON foo (ref_id,md5(content));
```
Run Code Online (Sandbox Code Playgroud)
这将改为将内容的 md5sum 存储在 btree 上。只要 ref_id 的内容在该 ref_id 上具有唯一的 md5，就很好。
如果您不在乎它content的独特性，请考虑将其完全删除。

当您UNIQUE使用 btree实现约束时（如 PostgreSQL 所做的那样），您会免费获得一个添加的索引，这可能毫无价值。在正常情况下，这具有附加福利。

CREATE TABLE foo ( ref_id int, content text );
CREATE UNIQUE INDEX ON foo (ref_id,content);

Run Code Online (Sandbox Code Playgroud)

会加快查询速度

SELECT *
FROM foo
WHERE ref_id = 5
  AND content = 'This content'

Run Code Online (Sandbox Code Playgroud)

但是，当您有机会使用功能md5()变体时，不再有内容索引，因此现在要使用该索引，您必须

只查询 ref_id，
添加到 ref_id 一个子句 md5(content) = md5('This content')

整体text = text评价过高。这几乎从来都不是你想要的。如果您希望通过文本加快查询时间，那么 btree 是非常无用的。你可能想看看

更新 1

根据您的JSON，我建议将其存储为jsonb，然后在上创建索引md5(content)；所以也许而不是上面的而是运行这个。

ALTER TABLE public.editor_contentmodel
  ALTER COLUMN content
  SET DATA TYPE jsonb
  USING content::jsonb;

CREATE UNIQUE INDEX ON foo (ref_id,md5(content::text));

Run Code Online (Sandbox Code Playgroud)

更新 2

您询问应该删除哪些索引

gollahalli_me_django=> create unique index on editor_contentmodel (ref_id, md5(content::text));
CREATE INDEX
gollahalli_me_django=> \d editor_contentmodel;
        Table "public.editor_contentmodel"
  Column   |           Type           | Modifiers
-----------+--------------------------+-----------
 ref_id    | character varying(120)   | not null
 content   | jsonb                    | not null
 timestamp | timestamp with time zone | not null
Indexes:
    "editor_contentmodel_pkey" PRIMARY KEY, btree (ref_id)
    "editor_contentmodel_content_2192f49c_uniq" UNIQUE CONSTRAINT, btree (content, ref_id) <---- 1
    "editor_contentmodel_ref_id_md5_idx" UNIQUE, btree (ref_id, md5(content::text))
    "editor_contentmodel_ref_id_8f74b4f3_like" btree (ref_id varchar_pattern_ops) <----2

Run Code Online (Sandbox Code Playgroud)

这是令人惊讶的答案：您应该删除所有这些，除了:editor_contentmodel_pkey这表示 allref_id必须是唯一的。

editor_contentmodel_content_2192f49c_uniq该索引确保您UNIQUE在ref_idAND 上content，但如果您不能有ref_id重复的内容，则永远不会有重复的内容ref_id。所以你永远不会违反这个索引而不违反editor_contentmodel_pkey. 这让它毫无意义。
editor_contentmodel_ref_id_md5_idx出于同样的原因，该索引也毫无意义。你永远不能有一个副本md5(content::text)，ref_id因为无论你的值md5(content::text)是什么，你都不能有一个副本ref_id。
editor_contentmodel_ref_id_8f74b4f3_like也是一个坏主意，因为您正在将索引复制到ref_id. 这不是没用，只是不是最优的。相反，如果您需要varchar_pattern_ops在content字段上使用它。

最后要注意的是，我们varchar在 PostgreSQL 中使用不多，因为它是作为带有检查约束的 varlena 实现的。它没有任何好处，只要使用text. 因此，除非有一个具体的原因为什么ref_id可以是 120 个字符，但它可以是 119 个字符，否则我将简单地使用该text类型。

更新 3

让我们回到你之前的问题..

psycopg2.OperationalError: index row size 3496 exceeds maximum 2712 for index "editor_contentmodel_content_2192f49c_uniq"

Run Code Online (Sandbox Code Playgroud)

这告诉您问题出在index 上"editor_contentmodel_content_2192f49c_uniq"。你已经将其定义为

"editor_contentmodel_content_2192f49c_uniq" UNIQUE CONSTRAINT, btree (content, ref_id)

Run Code Online (Sandbox Code Playgroud)

所以这里的问题是你试图在content. 但是，同样，索引本身存储了的实际 json 内容content，这就是超出限制的内容。这实际上不是问题，因为即使该限制没有到位，editor_contentmodel_content_2192f49c_uniq也完全没有用。为什么？同样，您无法向已保证 100% 唯一的行添加更多唯一性。你似乎没有得到这个。让我们保持简单。

ref_id | content
1      | 1
1      | 1
1      | 2
2      | 1

Run Code Online (Sandbox Code Playgroud)

在上面一个单独的唯一索引/约束（没有其他索引） over(ref_id, content)是有意义的，因为它会阻止(1,1). 索引结束 (ref_id, md5(content))也是有意义的，因为它会通过停止重复的(1,1)代理来停止重复(1, md5(1))。然而，所有这些都有效，因为在我给出的例子ref_id中不能保证是UNIQUE. 你ref_id的不是这个ref_id。你ref_id是一个PRIMARY KEY. 这意味着它保证是唯一的。

这意味着永远无法插入重复项(1,1)和行(1,2)。这也意味着除了 ref_id 之外的任何索引都不能保证更多的唯一性。它们必须不如您目前拥有的索引严格。所以你的桌子只能看起来像这样

ref_id | content
1      | 1
2      | 1

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，9 月前
查看次数：	7703 次
最近记录：	8 年，9 月前