Sim*_*kin 5 postgresql swap etl
我想回填一列大(20M 行)、经常阅读但很少写的表格。从关于 SO 的各种文章和问题来看,似乎最好的方法是创建一个具有相同结构的表,加载回填数据和实时交换(因为重命名非常快)。听起来不错!
但是当我真正编写脚本来执行此操作时,它的长度令人难以置信。这是一个味道:
BEGIN;
CREATE TABLE foo_new (LIKE foo);
-- I don't use INCLUDING ALL, because that produces Indexes/Constraints with different names
-- This is the only part of the script that is specific to my case.
-- Everything else is standard for any table swap
INSERT INTO foo_new (id, first_name, last_name, email, full_name)
(SELECT id, first_name, last_name, email, first_name || last_name) FROM foo);
CREATE SEQUENCE foo_new_id_seq
START 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;
SELECT setval('foo_new_id_seq', COALESCE((SELECT MAX(id)+1 FROM foo_new), 1), false);
ALTER SEQUENCE foo_new_id_seq OWNED BY foo_new.id;
ALTER TABLE ONLY foo_new ALTER COLUMN id SET DEFAULT nextval('foo_new_id_seq'::regclass);
ALTER TABLE foo_new
ADD CONSTRAINT foo_new_pkey
PRIMARY KEY (id);
COMMIT;
-- Indexes are made concurrently, otherwise they would block reads for
-- a long time. Concurrent index creation cannot occur within a transaction.
CREATE INDEX CONCURRENTLY foo_new_on_first_name ON foo_new USING btree (first_name);
CREATE INDEX CONCURRENTLY foo_new_on_last_name ON foo_new USING btree (last_name);
CREATE INDEX CONCURRENTLY foo_new_on_email ON foo_new USING btree (email);
-- One more line for each index
BEGIN;
ALTER TABLE foo RENAME TO foo_old;
ALTER TABLE foo_new RENAME TO foo;
ALTER SEQUENCE foo_id_seq RENAME TO foo_old_id_seq;
ALTER SEQUENCE foo_new_id_seq RENAME TO foo_id_seq;
ALTER TABLE foo_old RENAME CONSTRAINT foo_pkey TO foo_old_pkey;
ALTER TABLE foo RENAME CONSTRAINT foo_new_pkey TO foo_pkey;
ALTER INDEX foo_on_first_name RENAME TO foo_old_on_first_name;
ALTER INDEX foo_on_last_name RENAME TO foo_old_on_last_name;
ALTER INDEX foo_on_email RENAME TO foo_old_on_email;
-- One more line for each index
ALTER INDEX foo_new_on_first_name RENAME TO foo_on_first_name;
ALTER INDEX foo_new_on_last_name RENAME TO foo_on_last_name;
ALTER INDEX foo_new_on_email RENAME TO foo_on_email;
-- One more line for each index
COMMIT;
-- TODO: drop old table (CASCADE)
Run Code Online (Sandbox Code Playgroud)
And this doesn't even include foreign keys, or other constraints! Since the only part of this that is specific to my case in the INSERT INTO bit, I'm surprised that there's no built-in Postgres function to do this sort of swapping. Is this operation less common than I make it out to be? Am I underestimating the variety of ways this can be accomplished? Is my desire to keep naming consistent an atypical one?
这可能并不常见。大多数表都不够大,无法保证它的存在,并且大多数应用程序可以容忍一些地方的停机时间。
更重要的是,不同的应用程序可以根据其工作负载以不同的方式偷工减料。数据库服务器不能;它需要处理(或故意不处理)所有可能的模糊边缘情况,这可能比您预期的要困难得多。最终,为不同的用例编写定制的解决方案可能更有意义。
无论如何,如果您只是想将计算字段实现为first_name || last_name,则有更好的方法:
ALTER TABLE foo RENAME TO foo_base;
CREATE VIEW foo AS
SELECT
id,
first_name,
last_name,
email,
(first_name || last_name) AS full_name
FROM foo_base;
Run Code Online (Sandbox Code Playgroud)
假设您的真实案例更复杂,所有这些努力可能仍然是不必要的。我相信复制和重命名方法主要基于这样一个假设,即您需要在此过程的持续时间内锁定表以防止并发修改,因此目标是尽快完成它。如果所有并发操作都是只读的 - 这似乎是这种情况,因为你没有锁定表 - 那么你可能最好使用简单的UPDATE(不会阻塞SELECTs),即使它确实需要更长一点(尽管它确实具有避免外键重新检查和 TOAST 表重写的优点)。
如果这种方法真的合理,我认为有一些改进的机会:
CREATE INDEX CONCURRENTLY似乎没有必要,因为其他人不应该尝试访问foo_new。事实上,如果整个脚本都在一个事务中,那么此时它甚至不会在外部可见。RENAMEs 替换为单个ALTER TABLE foo SET SCHEMA public.LOCK foo IN SHARE MODE无论如何也不会受到伤害......编辑:
序列重新分配比我预期的要复杂一些,因为它们似乎需要与其父表保持在相同的模式中。但这是(似乎是)一个工作示例:
BEGIN;
LOCK public.foo IN SHARE MODE;
CREATE SCHEMA tmp;
CREATE TABLE tmp.foo (LIKE public.foo);
INSERT INTO tmp.foo (id, first_name, last_name, email, full_name)
SELECT id, first_name, last_name, email, (first_name || last_name) FROM public.foo;
ALTER TABLE tmp.foo ADD CONSTRAINT foo_pkey PRIMARY KEY (id);
CREATE INDEX foo_on_first_name ON tmp.foo (first_name);
CREATE INDEX foo_on_last_name ON tmp.foo (last_name);
CREATE INDEX foo_on_email ON tmp.foo (email);
ALTER TABLE tmp.foo ALTER COLUMN id SET DEFAULT nextval('public.foo_id_seq');
ALTER SEQUENCE public.foo_id_seq OWNED BY NONE;
DROP TABLE public.foo;
ALTER TABLE tmp.foo SET SCHEMA public;
ALTER SEQUENCE public.foo_id_seq OWNED BY public.foo.id;
DROP SCHEMA tmp;
COMMIT;
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3533 次 |
| 最近记录: |