Ser*_*rge 6 sql postgresql performance shuffle
在具有> 100k行的表中,如何有效地混洗特定列的值?
表定义:
CREATE TABLE person
(
id integer NOT NULL,
first_name character varying,
last_name character varying,
CONSTRAINT person_pkey PRIMARY KEY (id)
)
Run Code Online (Sandbox Code Playgroud)
为了匿名化数据,我必须将'first_name'列的值放在适当位置(我不允许创建新表).
我的尝试:
with
first_names as (
select row_number() over (order by random()),
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()),
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;
Run Code Online (Sandbox Code Playgroud)
这需要几个小时才能完成.
有没有一种有效的方法呢?
postgres的问题是每次更新意味着delete
+insert
SELECT
来检查分析,UPDATE
以查看CTE的性能.
CREATE TABLE new_table AS
SELECT * ....
DROP oldtable;
Rename new_table to old_table
CREATE index and constrains
Run Code Online (Sandbox Code Playgroud)
对不起,这不是你的选择:(
编辑:阅读后 a_horse_with_no_name
看起来你需要
with
first_names as (
select row_number() over (order by random()) rn,
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()) rn,
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names
join ids
on first_names.rn = ids.rn
where id = ref_id;
Run Code Online (Sandbox Code Playgroud)
再次提出性能问题如果提供ANALYZE / EXPLAIN
结果则更好.
这需要 5 秒钟才能在我的笔记本电脑上随机播放 500.000 行:
with names as (
select id, first_name, last_name,
lead(first_name) over w as first_1,
lag(first_name) over w as first_2
from person
window w as (order by random())
)
update person
set first_name = coalesce(first_1, first_2)
from names
where person.id = names.id;
Run Code Online (Sandbox Code Playgroud)
这个想法是在随机排序数据后选择“下一个”名称。这就像选择一个随机名称一样好。
有可能不是所有的名字都被洗牌了,但是如果你运行两三次,这应该就足够了。
这是 SQLFiddle 的测试设置:http ://sqlfiddle.com/#!15/15713/1
右侧的查询检查在“随机化”之后是否有任何名字保持不变
归档时间: |
|
查看次数: |
1558 次 |
最近记录: |