对大型数据库的查询终止与服务器的连接,与 LIMIT 一起使用

doe*_*unt 2 postgresql mac-os-x kill connections jit

我正在尝试在大型数据库上运行查询而不终止与服务器的连接。

我在具有 16GB 内存和大约 40GB 可用磁盘空间的 Mac 上使用 Postgres 12.1。根据数据,数据库为 78GB pg_database_size,最大表为 20GB pg_total_relation_size

无论我运行哪个非工作查询,我(从日志中)得到的错误是:

server process (PID xxx) was terminated by signal 9: Killed: 9
Run Code Online (Sandbox Code Playgroud)

在 VS 代码中,错误是"lost connection to server".

两个不起作用的例子是:

UPDATE table
SET column = NULL
WHERE column = 0;
Run Code Online (Sandbox Code Playgroud)
select columnA
from table1
where columnA NOT IN (
select columnB
from table2
);
Run Code Online (Sandbox Code Playgroud)

我可以通过添加 1,000,000 个来运行某些查询(例如上面的查询)LIMIT

我怀疑由于临时文件而导致磁盘不足,但在日志(带有log_temp_files = 0)中,我看不到任何正在写入的临时文件。

我尝试增加和减少work_mem, maintenance_work_mem,shared_bufferstemp_buffers。没有一个起作用,性能大致相同。

我尝试删除所有索引,这降低了某些查询的“成本”,但它们仍然终止了与服务器的连接。

我的问题可能是什么?我该如何进一步解决这个问题?

此外,我了解到超时查询的临时文件存储在 pqsql_tmp 中。我检查了该文件夹,它没有很大的文件。临时文件可以存储在其他地方吗?


运行失败查询的 posgtres 日志如下所示:

2020-02-17 09:31:08.626 CET [94908] LOG:  server process (PID xxx) was terminated by signal 9: Killed: 9
2020-02-17 09:31:08.626 CET [94908] DETAIL:  Failed process was running: update table
        set columnname = NULL
        where columnname = 0;

2020-02-17 09:31:08.626 CET [94908] LOG:  terminating any other active server processes
2020-02-17 09:31:08.626 CET [94919] WARNING:  terminating connection because of crash of another server process
2020-02-17 09:31:08.626 CET [94919] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exi$
2020-02-17 09:31:08.626 CET [94919] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-02-17 09:31:08.626 CET [94914] WARNING:  terminating connection because of crash of another server process
2020-02-17 09:31:08.626 CET [94914] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exi$
2020-02-17 09:31:08.626 CET [94914] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-02-17 09:31:08.629 CET [94908] LOG:  all server processes terminated; reinitializing
2020-02-17 09:31:08.698 CET [94927] LOG:  database system was interrupted; last known up at 2020-02-17 09:30:57 CET
2020-02-17 09:31:08.901 CET [94927] LOG:  database system was not properly shut down; automatic recovery in progress
2020-02-17 09:31:08.906 CET [94927] LOG:  invalid record length at 17/894C438: wanted 24, got 0
2020-02-17 09:31:08.906 CET [94927] LOG:  redo is not required
Run Code Online (Sandbox Code Playgroud)

运行EXPLAIN第二个示例查询将返回:

Seq Scan on gas_prices_all p  (cost=459.93..5635583.33 rows=128975016 width=16)
  Filter: (NOT (hashed SubPlan 1))
  SubPlan 1
    ->  Seq Scan on gas_station g  (cost=0.00..423.14 rows=14714 width=16)
JIT:
  Functions: 13
  Options: Inlining true, Optimization true, Expressions true, Deforming true
Run Code Online (Sandbox Code Playgroud)

FWIW 对于“成本”大约为零的查询,我得到相同的错误。


更新:我发现了今天早些时候的崩溃报告:

Process:               postgres [41042]
Path:                  /Users/USER/*/postgres
Identifier:            postgres
Version:               0
Code Type:             X86-64 (Native)
Parent Process:        postgres [40107]
Responsible:           postgres [41042]
User ID:               502

Date/Time:             2020-02-18 11:16:26.210 +0100
OS Version:            Mac OS X 10.14.5 (18F132)
Report Version:        12
Anonymous UUID:        F41CCD21-C558-6CB0-316D-D1FF3E279576

Sleep/Wake UUID:       5F08EAEC-373A-4D19-A243-E812E68D2697

Time Awake Since Boot: 1600000 seconds
Time Since Wake:       5700 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (Code Signature Invalid)
Exception Codes:       0x0000000000000032, 0x00000001044c4060
Exception Note:        EXC_CORPSE_NOTIFY

Termination Reason:    Namespace CODESIGNING, Code 0x2

kernel messages:

VM Regions Near 0x1044c4060:
    __LINKEDIT             0000000104466000-00000001044c4000 [  376K] r--/rw- SM=COW  /Users/USER/*/*.dylib
--> VM_ALLOCATE            00000001044c4000-00000001044c5000 [    4K] r-x/rwx SM=ZER  
    VM_ALLOCATE            00000001044c5000-00000001044c6000 [    4K] rw-/rwx SM=ZER  

Application Specific Information:
crashed on child side of fork pre-exec

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   ???                             0x00000001044c4060 0 + 4367073376
1   postgres                        0x0000000103798851 ExecAgg + 1185 (executor.h:308)
2   postgres                        0x0000000103785d0f standard_ExecutorRun + 287 (execMain.c:1652)
3   postgres                        0x0000000103789c4e ParallelQueryMain + 670 (execParallel.c:1405)
4   postgres                        0x00000001036520ce ParallelWorkerMain + 1054 (parallel.c:1434)
5   postgres                        0x000000010385bec5 StartBackgroundWorker + 533 (bgworker.c:834)
6   postgres                        0x000000010386acb9 maybe_start_bgworkers + 1161
7   postgres                        0x00000001038696c5 sigusr1_handler + 357 (postmaster.c:5167)
8   libsystem_platform.dylib        0x00007fff76195b5d _sigtramp + 29
9   ???                             0x0000000000003200 0 + 12800
10  postgres                        0x00000001037d54ae main + 1678
11  libdyld.dylib                   0x00007fff75faa3d5 start + 1

Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0x00000001044c4060  rbx: 0x00007f98b9045600  rcx: 0x00000001072c48d8  rdx: 0x00007ffeec6242f4
  rdi: 0x00007f98b9047520  rsi: 0x00007f98b9045fc0  rbp: 0x00007ffeec624320  rsp: 0x00007ffeec624288
   r8: 0x00000000014bafca   r9: 0xffffffff00000000  r10: 0x00000001072c48d0  r11: 0x0000000000000005
  r12: 0x0000000103c51220  r13: 0x00007f98b9047510  r14: 0x00007f98b9045fc0  r15: 0x00007f98b90459a0
  rip: 0x00000001044c4060  rfl: 0x0000000000010246  cr2: 0x00000001044c4060

Logical CPU:     0
Error Code:      0x00000015
Trap Number:     14


Binary Images:
      **lots of stuff**

External Modification Summary:
  Calls made by other processes targeting this process:
    task_for_pid: 0
    thread_create: 0
    thread_set_state: 0
  Calls made by this process:
    task_for_pid: 0
    thread_create: 0
    thread_set_state: 0
  Calls made by all processes on this machine:
    task_for_pid: 134420121
    thread_create: 0
    thread_set_state: 0

VM Region Summary:
ReadOnly portion of Libraries: Total=398.8M resident=0K(0%) swapped_out_or_unallocated=398.8M(100%)
Writable regions: Total=4.2G written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=4.2G(100%)

                                VIRTUAL   REGION 
REGION TYPE                        SIZE    COUNT (non-coalesced) 
===========                     =======  ======= 
Kernel Alloc Once                    8K        1 
MALLOC                            81.0M       19 
MALLOC guard page                   16K        3 
MALLOC_LARGE (reserved)             72K        2         reserved VM address space (unallocated)
STACK GUARD                       56.0M        1 
Stack                             8192K        1 
VM_ALLOCATE                        4.1G        4 
__DATA                            18.0M      200 
__FONT_DATA                          4K        1 
__LINKEDIT                       232.2M       11 
__TEXT                           166.6M      199 
__UNICODE                          564K        1 
mapped file                         64K        1 
shared memory                      552K        6 
===========                     =======  ======= 
TOTAL                              4.7G      450 
TOTAL, minus reserved VM space     4.7G      450 
Run Code Online (Sandbox Code Playgroud)

我可以在日志中找到两件可能有趣的事情:

2020-02-18 19:01:52.044375+0100  localhost kernel[0]: CODE SIGNING: process 51528[postgres]: rejecting invalid page at address 0x1100c1000 from offset 0x0 in file "<nil>" (cs_mtime:0.0 == mtime:0.0) (signed:0 validated:0 tainted:0 nx:0 wpmapped:1 dirty:0 depth:0)
2020-02-18 19:01:52.044805+0100  localhost ReportCrash[52560]: unknown nested kcdata type: 0x1004, size: 2108
Run Code Online (Sandbox Code Playgroud)
2020-02-18 19:01:55.268060+0100  localhost ReportCrash[52560]: (CrashReporterSupport) Saved crash report for postgres[51528] version 0 to postgres_2020-02-18-190155_MacBook-Pro.crash
2020-02-18 19:01:55.273159+0100  localhost ReportCrash[52560]: (CrashReporterSupport) Removing excessive log: postgres_2020-02-18-190155_MacBook-Pro.crash
2020-02-18 19:01:55.274208+0100  localhost ReportCrash[52560]: shouldDisplayUnexpectedlyQuitNotification is NO
Run Code Online (Sandbox Code Playgroud)

更新:我运行了第二个示例查询(上面),LIMIT直到它从工作(LIMIT2,200,000)到终止与服务器的连接(LIMIT2,300,000)。EXPLAIN ANALYZE2,200,000 条查询的内容LIMIT是:

Limit  (cost=459.93..96581.42 rows=2200000 width=16) (actual time=13.228..38573.440 rows=2200000 loops=1)
  ->  Seq Scan table1  (cost=459.93..5635583.33 rows=128975016 width=16) (actual time=13.227..38374.070 rows=2200000 loops=1)
        Filter: (NOT (hashed SubPlan 1))
        Rows Removed by Filter: 139729529
        SubPlan 1
          ->  Seq Scan on table2  (cost=0.00..423.14 rows=14714 width=16) (actual time=0.350..6.925 rows=14714 loops=1)
Planning Time: 0.138 ms
Execution Time: 38685.762 ms
Run Code Online (Sandbox Code Playgroud)

EXPLAIN2,300,000上LIMIT是:(EXPLAIN ANALYZE此处崩溃)

Limit  (cost=459.93..100950.58 rows=2300000 width=16)
  ->  Seq Scan on table1  (cost=459.93..5635583.33 rows=128975016 width=16)
        Filter: (NOT (hashed SubPlan 1))
        SubPlan 1
          ->  Seq Scan on table2  (cost=0.00..423.14 rows=14714 width=16)
JIT:
  Functions: 14
  Options: Inlining false, Optimization false, Expressions true, Deforming true
Run Code Online (Sandbox Code Playgroud)

我正在阅读这篇文章,因为 JIT 在这里启动,这是有道理的,jit_above_cost因为设置(默认)为 100,000。那么也许问题出在 JIT 上?


更新 2:jit=off第二个示例查询可以工作,并且也可以。速度快两倍。这是怎么回事?什么可能导致 JIT 在我的系统上出现问题?

jja*_*nes 5

“signal 9”表明这几乎肯定是内存不足杀手。您可以查看/var/log/kern.log或系统上调用的任何内容进行验证。临时文件上的磁盘空间不足不会导致此错误。

您应该对EXPLAIN查询进行操作以查看它正在使用什么计划。我通常会猜测这是一个疯狂的 HashAgg,除非你的第一个查询不应该进行任何聚合、哈希或其他操作。

您也可以尝试关闭过度使用。这样,您的进程就会遇到普通的内存不足错误,然后它们可以以有序的方式报告这些错误,而不是仅仅从轨道上被炸死而没有机会报告问题。有针对 Linux 的说明,也许你可以将它们改编为 MacOS。如果您可以在内存使用量开始大量增长时捕获该进程,但在其被终止之前,您可以获得有关内存分配位置的报告