如何在小型Vertica数据库中加速慢速多连接查询(总行约120K,10分钟)

Mat*_*ell 4 sql performance join vertica

我很乐意帮助您理解为什么这个连接繁重的查询需要大约10分钟才能在一个总共<120K行的七个表的小型数据库上运行,理想情况下,如何在我们的四个小集群上更快地获取建议节点.我在这里提供了支持信息:https://gist.github.com/anonymous/8862796(表格列表,表格列表和表格大小),但以下是查询和EXPLAIN VERBOSE输出.我在此查询上运行了ANALYZE_WORKLOAD(),然后按照其建议在所有表上运行ANALYZE_STATISTICS.这导致没有改善.然后,我做了第二个运行Database Designer的建议,这导致性能更低.我非常感谢你的帮助.

档案信息

感谢以下提示re:PROFILE.我运行它并将结果放在这里:https://gist.github.com/anonymous/8935190.这是8K行长,所以也许我没有正确运行它(详细说明).问题:如何开始分析?

查询背景故事

查询混乱的原因主要是因为它是为我们的机器学习研究软件的每次运行动态生成的,它必须通过所涉及的ER表走路径图形式,应用各种条件.在这种情况下,路径是[费率,电影,费率,ml_user,费率,电影,费率].在程序探索解决方案空间的过程中逐步建立查询,这就是为什么(当前)没有人们通过@wumpz和@Bohemian友好正确地建议的优化,例如,消除子选择.这意味着我有点'坚持目前的短期形式: - /

 ------------------------------ 
 QUERY PLAN DESCRIPTION: 
 ------------------------------

 Opt Vertica Options
 --------------------
 PLAN_OUTPUT_SUPER_VERBOSE


 EXPLAIN VERBOSE
 SELECT relVarTable0.id AS id, relVarTable1.val, relVarTable2.val 
 FROM (SELECT id FROM rates) relVarTable0
 LEFT JOIN
 (SELECT rates1.id AS id, AVG(rates4.rating) AS val 
 FROM rates rates1, movie movie1, rates rates2, ml_user ml_user1, rates rates3, movie movie2, rates rates4 
 WHERE movie1.id = rates1.movie_id AND movie1.id = rates2.movie_id AND ml_user1.id = rates2.ml_user_id AND ml_user1.id = rates3.ml_user_id AND movie2.id = rates3.movie_id AND movie2.id = rates4.movie_id AND movie1.id <> movie2.id AND rates1.id <> rates2.id AND rates2.id <> rates3.id AND rates3.id <> rates4.id AND rates4.rating IS NOT NULL 
 GROUP BY rates1.id) relVarTable1
  ON relVarTable0.id = relVarTable1.id
 LEFT JOIN
 (SELECT rates1.id AS id, rates1.rating AS val 
 FROM rates rates1 
 WHERE rates1.rating IS NOT NULL ) relVarTable2
  ON relVarTable0.id = relVarTable2.id;

 Access Path:
 Sort Key: (V(1,1))
  LDISTRIB_UNSEGMENTED
 +-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4489.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5441368.000000 Memory(B): 1209184.000000 Netwrk(B): 1209184.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 1) Inner (RESEGMENT)
 |  Join Cond: (relVarTable0.id = relVarTable2.id)
 |  Execute on: All Nodes
 |  Sort Key: (V(1,1))
 |  LDISTRIB_UNSEGMENTED
 | +-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4197.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 1369200.000000 Memory(B): 0.000000 Netwrk(B): 604600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 2) Outer (RESEGMENT)
 | |      Join Cond: (relVarTable0.id = relVarTable1.id)
 | |      Execute on: All Nodes
 | |      Sort Key: (V(1,1))
 | |      LDISTRIB_UNSEGMENTED
 | | +-- Outer -> SELECT [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 3)
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | |      LDISTRIB_UNSEGMENTED
 | | | +---> STORAGE ACCESS for rates [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 4)
 | | | |      Column Cost Aspects: [ Disk(B): 196608.000000 CPU(B): 0.000000 Memory(B): 604600.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | |      Projection: movielens_test.rates_b0
 | | | |      Materialize: rates.id
 | | | |      Execute on: All Nodes
 | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | |      LDISTRIB_SEGMENTED
 | | +-- Inner -> SELECT [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 5)
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id)
 | | |      LDISTRIB_UNSEGMENTED
 | | | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 6650600.000000 Memory(B): 640000.000000 Netwrk(B): 6890600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 6)
 | | | |      Aggregates: sum_float(<SVAR>), count(<SVAR>)
 | | | |      Group By: rates1.id
 | | | |      Execute on: All Nodes
 | | | |      Sort Key: (rates.id)
 | | | |      LDISTRIB_SEGMENTED
 | | | | +---> JOIN HASH [Cost: 2869.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 12091944.000000 Memory(B): 3022960.000000 Netwrk(B): 1813776.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 88] (PATH ID: 7) Inner (RESEGMENT)
 | | | | |      Join Cond: (movie2.id = rates4.movie_id)
 | | | | |      Join Filter: (rates3.id <> rates4.id)
 | | | | |      Execute on: All Nodes
 | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | |      LDISTRIB_UNSEGMENTED
 | | | | | +-- Outer -> JOIN HASH [Cost: 2395.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 9110592.000000 Memory(B): 41592.000000 Netwrk(B): 4246064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 64] (PATH ID: 8) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | |      Join Cond: (movie2.id = rates3.movie_id)
 | | | | | |      Join Filter: (movie1.id <> movie2.id)
 | | | | | |      Execute on: All Nodes
 | | | | | |      Runtime Filter: (SIP1(HashJoin): movie2.id)
 | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | +-- Outer -> JOIN HASH [Cost: 1625.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 10278200.000000 Memory(B): 3023000.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 56] (PATH ID: 9) Inner (RESEGMENT)
 | | | | | | |      Join Cond: (ml_user1.id = rates3.ml_user_id)
 | | | | | | |      Join Filter: (rates2.id <> rates3.id)
 | | | | | | |      Execute on: All Nodes
 | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | |      LDISTRIB_UNSEGMENTED
 | | | | | | | +-- Outer -> JOIN HASH [Cost: 1163.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5582544.000000 Memory(B): 141144.000000 Netwrk(B): 2465448.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 10) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | | | |      Join Cond: (ml_user1.id = rates2.ml_user_id)
 | | | | | | | |      Execute on: All Nodes
 | | | | | | | |      Runtime Filter: (SIP2(HashJoin): ml_user1.id)
 | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | +-- Outer -> JOIN HASH [Cost: 711.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 8464400.000000 Memory(B): 2418400.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 32] (PATH ID: 11) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
 | | | | | | | | |      Join Cond: (movie1.id = rates2.movie_id)
 | | | | | | | | |      Join Filter: (rates1.id <> rates2.id)
 | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates2 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 12)
 | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | | | |      Materialize: rates2.id, rates2.ml_user_id, rates2.movie_id
 | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | +-- Inner -> JOIN HASH [Cost: 268.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 3064592.000000 Memory(B): 41592.000000 Netwrk(B): 1223064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 13) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | | | | | |      Join Cond: (movie1.id = rates1.movie_id)
 | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates1 [Cost: 39.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 14)
 | | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 0.000000 Memory(B): 1209200.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | | | | |      Materialize: rates1.id, rates1.movie_id
 | | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | | +-- Inner -> STORAGE ACCESS for movie1 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 15)
 | | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | | |      Projection: movielens_test.movie_b0
 | | | | | | | | | | |      Materialize: movie1.id
 | | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | | |      Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
 | | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | +-- Inner -> STORAGE ACCESS for ml_user1 [Cost: 5.000000, Rows: 5881.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 16)
 | | | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 47048.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | |      Projection: movielens_test.ml_user_b0
 | | | | | | | | |      Materialize: ml_user1.id
 | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | |      Sort Key: (ml_user.id, ml_user.gender, ml_user.age_range, ml_user.occupation, ml_user.zipcode, ml_user.num_ratings)
 | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | +-- Inner -> STORAGE ACCESS for rates3 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 17)
 | | | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | |      Materialize: rates3.id, rates3.ml_user_id, rates3.movie_id
 | | | | | | | |      Execute on: All Nodes
 | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | +-- Inner -> STORAGE ACCESS for movie2 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 18)
 | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | |      Projection: movielens_test.movie_b0
 | | | | | | |      Materialize: movie2.id
 | | | | | | |      Execute on: All Nodes
 | | | | | | |      Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
 | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | +-- Inner -> STORAGE ACCESS for rates4 [Cost: 60.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PUSHED GROUPING) Partial GroupBy: rates4.movie_id,rates4.id Partial Aggs: sum_float(<SVAR>),count(<SVAR>) (PATH ID: 19)
 | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 196608.000000 Memory(B): 1813784.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | |      Projection: movielens_test.rates_b0
 | | | | | |      Materialize: rates4.rating, rates4.id, rates4.movie_id
 | | | | | |      Filter: (rates4.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
 | | | | | |      Execute on: All Nodes
 | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | |      LDISTRIB_SEGMENTED
 | +-- Inner -> SELECT [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 20)
 | |      Execute on: All Nodes
 | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | |      LDISTRIB_UNSEGMENTED
 | | +---> STORAGE ACCESS for rates1 [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 21)
 | | |      Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 196608.000000 Memory(B): 1209184.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | |      Projection: movielens_test.rates_b0
 | | |      Materialize: rates1.rating, rates1.id
 | | |      Filter: (rates1.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | |      LDISTRIB_SEGMENTED


 ------------------------------ 
Run Code Online (Sandbox Code Playgroud)

Up_*_*One 7

首先,我NO STATISTICS在你的解释计划中看到了太多.这是一个坏主意,你应该修复它.

在连接中查看表的顺序?创建了一个哈希联接,您正在对最大的表进行完整的tbl扫描.通过执行散列连接(小表连接大表)而不是(大表连接小表)来修复此问题.

  1. 运行DBD
  2. 运行分析
  3. 对其运行解释并确保使用投影来回答您的查询
  4. 检查您是否movielens_test.rates可以分区
    • 如果运行oa单节点(MPP)不会被使用,因为这是一个巨大的胜利
    • 在您的查询上运行配置文件并发布它
    • 确保您的DDL上应用了正确的数据编码以及cols的订单顺序,以更好地解决您使用的谓词(我认为这将由DBD完成 - 我总是检查)

作为最后一点,我总是这样做:

打开数据库日志并在运行查询时观察它.如果您有数据溢出到磁盘上,这可能是您的问题,因为您的排序数据比您分配的内存大.

另一个选项是您在第一个子查询上创建预连接投影.但是,只有当您的数据没有遭受许多数据更改时,因为预先连接的预测非常糟糕,因为数据加载它们.