jpb*_*ini 4 postgresql performance join query-performance
我有几张桌子需要加入。我有一个员工表(约 40 万行)、一个公司表(约 1000 万行)和一个存储某人工作地点的员工公司表。
基本上,我需要让所有符合某些条件的员工(他们在拥有网站的公司工作,位于某个国家/地区等)。我进行了查询以获取此信息,但花费的时间太长。我需要加快速度。
SELECT DISTINCT "employees".*
FROM "employees"
INNER JOIN "employee_companies" ON "employee_companies"."employee_id" = "employees"."id"
INNER JOIN "companies" ON "companies"."id" = "employee_companies"."company_id"
WHERE (employee_companies.employee_id IS NOT NULL)
AND (companies.website IS NOT NULL)
AND (employees.country = 'Uruguay')
ORDER BY employees.connections DESC
Run Code Online (Sandbox Code Playgroud)
这是该查询的计划:
Unique (cost=877170.24..880752.72 rows=62304 width=1064) (actual time=24023.736..26001.876 rows=73318 loops=1)
-> Sort (cost=877170.24..877326.00 rows=62304 width=1064) (actual time=24023.733..24305.989 rows=77579 loops=1)
Sort Key: employees.connections DESC, employees.id, employees.name, employees.link, employees.role, employees.area, employees.profile_picture, employees.summary, employees.current_companies, employees.previous_companies, employees.skills, employees.education, employees.languages, employees.volunteer, employees.groups, employees.interests, employees.search_vector, employees.secondary_search_vector, employees.email_status, employees.languages_count, employees.role_hierarchy
Sort Method: external merge Disk: 85816kB
-> Nested Loop (cost=2642.38..843246.15 rows=62304 width=1064) (actual time=139.870..23056.234 rows=77579 loops=1)
-> Hash Join (cost=2641.95..221744.50 rows=77860 width=1068) (actual time=139.841..22617.587 rows=77579 loops=1)
Hash Cond: (employees.id = employee_companies.employee_id)
-> Seq Scan on employees (cost=0.00..212178.88 rows=409672 width=1064) (actual time=8.145..22369.166 rows=393725 loops=1)
Filter: ((country)::text = 'Uruguay'::text)
Rows Removed by Filter: 1075
-> Hash (cost=1666.42..1666.42 rows=78042 width=8) (actual time=44.675..44.675 rows=78042 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 4073kB
-> Seq Scan on employee_companies (cost=0.00..1666.42 rows=78042 width=8) (actual time=0.007..22.901 rows=78042 loops=1)
Filter: (employee_id IS NOT NULL)
-> Index Scan using companies_pkey on companies (cost=0.43..7.97 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=77579)
Index Cond: (id = employee_companies.company_id)
Filter: (website IS NOT NULL)
Planning time: 1.957 ms
Execution time: 26025.045 ms
Run Code Online (Sandbox Code Playgroud)
这些是我桌子上的相关索引:
员工:
"employees_pkey" PRIMARY KEY, btree (id)
"ix_employees_country" btree (country)
Run Code Online (Sandbox Code Playgroud)
公司:
"companies_pkey" PRIMARY KEY, btree (id)
"empty_websites" btree (website) WHERE website IS NULL
"index_companies_on_website" btree (website)
"not_empty_websites" btree (website) WHERE website IS NOT NULL
Run Code Online (Sandbox Code Playgroud)
员工_公司:
"employee_companies_pkey" PRIMARY KEY, btree (id)
"index_employee_companies_on_company_id" btree (company_id)
"index_employee_companies_on_employee_id" btree (employee_id)
"index_employee_companies_on_employee_id_and_company_id" btree (employee_id, company_id)
"not_empty_employee_id" btree (employee_id) WHERE employee_id IS NOT NULL
Run Code Online (Sandbox Code Playgroud)
有没有其他更好的方法来做我想做的更高效/性能更好的事情?
谢谢!
基于一些猜测模拟,我认为您可以通过以下方式稍微改进您的查询:
DISTINCT
子句(尽管会有一个隐含的DISTINCT
)。JOIN
。查询如下:
SELECT
employees.*
FROM
employees
WHERE
employee_id IN
(SELECT
-- Choose all employees from companies with website
employee_id
FROM
employee_companies
JOIN companies ON companies.company_id = employee_companies.company_id
WHERE
companies.website IS NOT NULL
)
-- Now filter only employees from 'Germany'
AND employees.country = 'Germany'
ORDER BY
employees.connections DESC ;
Run Code Online (Sandbox Code Playgroud)
用于生成模拟的数据如下:
表和索引定义:
CREATE TABLE employees
(
employee_id integer PRIMARY KEY,
country text,
connections integer,
something_else text
) ;
CREATE INDEX idx_employee_country
ON employees (country) ;
CREATE TABLE companies
(
company_id integer PRIMARY KEY,
website text,
something_else text
) ;
CREATE INDEX not_empty_websites
ON companies(company_id, website) WHERE website IS NOT NULL ;
CREATE TABLE employee_companies
(
employee_id integer NOT NULL REFERENCES employees(employee_id),
company_id integer NOT NULL REFERENCES companies(company_id),
PRIMARY KEY (employee_id, company_id)
) ;
CREATE INDEX company_employee
ON employee_companies(company_id, employee_id) ;
Run Code Online (Sandbox Code Playgroud)
1.000.000 家公司(更改为 10M 没有太大区别)。我假设 90% 有一个网站。
INSERT INTO
companies
(company_id, website)
SELECT
generate_series(1, 1000000),
CASE WHEN random() > 0.1 THEN 'web.com' END AS website ;
Run Code Online (Sandbox Code Playgroud)
80k 员工(大约 10% 是德国人)
INSERT INTO
employees
(employee_id, country, connections)
SELECT
generate_series(1, 80000),
case (random()*10)::integer
when 0 then 'Germany'
when 1 then 'United Kingdon'
when 2 then 'United States'
else 'Angola'
end AS country,
(random()*10)::integer AS connections ;
Run Code Online (Sandbox Code Playgroud)
200K 员工 x 公司(这意味着人们平均在大约 3 家公司工作过):
INSERT INTO
employee_companies
(employee_id, company_id)
SELECT DISTINCT
(random()*79999)::integer + 1,
(random()*999999)::integer + 1
FROM
generate_series (1, 200000) ;
Run Code Online (Sandbox Code Playgroud)
您可以在dbfiddle 此处查看此模拟的缩小版本。如果此模拟数据与您的场景非常相似,则更改查询可使服务器执行时间提高 3 倍。我建议你试一试。
模拟数据(按比例缩小 25 倍)与真实场景更相似的场景并没有提供如此可观的性能提升……不过,它提高了 1.5 倍。
在这个dbfiddle检查它
归档时间: |
|
查看次数: |
10012 次 |
最近记录: |