Nginx+PHP-FPM偶尔返回502

Mar*_*lka 5 php nginx fpm

这个问题已经被问过很多次了,但没有一个答案有帮助。经过几个小时的挖掘,我来到这里寻求帮助。我是一名系统管理经验有限的开发人员,但由于我们的运维人员离开了,我只能尝试让事情保持活力。

在我们的一个网站上,我们最近开始随机收到 502 错误。这种情况经常发生,每天至少十几次(根据 nagios 和有时我们的用户的报告)。我不知道有任何配置更改。Web 堆栈是标准的 - nginx 服务器将请求代理到 php-fpm,后者运行基于 wordpress 的应用程序。


nginx 错误日志包含很多类似这样的消息:

[error] 31180#31180: *451395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: x.x.x.x, server: x.x, request: "GET /x/x/ HTTP/1.0", upstream: "fastcgi://127.0.0.1:9000", host: "x.x.x"
Run Code Online (Sandbox Code Playgroud)

其中大多数来自客户端 IP,即服务器本身的 IP(不知道为什么,也许是一些监控?),但也有来自随机公共 IP 的错误。

PHP-FPM 日志大约每小时都会发出这样的警告:

WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 71 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 75 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 79 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 83 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 87 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 91 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 95 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 99 total children
WARNING: [pool www] server reached pm.max_children setting (100), consider raising it
Run Code Online (Sandbox Code Playgroud)

我尝试过的事情

重新启动

很明显,但根本没有帮助。

增加资源,PHP-FPM子进程

  • 增加可用 RAM、CPU 没有帮助。磁盘未满,inode 未充分使用。
  • 随着资源的增加,我将最多设置pm.max_children为 100。原来是 40,对于多年的运行来说还可以。看到日志后,我尝试将其调至 75,然后调至 100。
  • 另一个访问量增加数倍的网站硬件较少,但运行良好。该网站不提供任何困难的内容,大部分只是博客。
  • 为了完整起见,FPM 配置如下所示:

    pm.max_children = 100
    pm.start_servers = 24
    pm.min_spare_servers = 4
    pm.max_spare_servers = 64
    pm.max_requests = 500
    
    Run Code Online (Sandbox Code Playgroud)
  • 日志中也没有提及有关运行 OOM 的信息。

研究opcache

  • 我读到 opcache 内存不足可能是罪魁祸首。唉,它还有空闲内存:

    Cache hits  89757614
    Cache misses    1174
    Used memory 58333696
    Free memory 75884032
    Wasted memory   0
    OOM restarts    0
    
    Run Code Online (Sandbox Code Playgroud)

Nginx 超时

  • Nginx 参数不应该是问题,因为缓冲区和超时值似乎相当慷慨(我假设 3000 的单位是秒):

    client_header_timeout 3000;
    client_body_timeout 3000;
    fastcgi_read_timeout 3000;
    fastcgi_buffers 16 16k;
    fastcgi_buffer_size 32k;
    
    Run Code Online (Sandbox Code Playgroud)

其他信息

  • PHP-FPM 没有崩溃,除了有关子项的警告之外,其日志中没有任何内容
  • xdebug 被禁用
  • syslog,dmesg 不包含任何相关消息
  • php7.0、nginx 1.12.2

还有什么我可以尝试的吗?


指向无效内容的链接