ESXi v5.5 出现随机崩溃

Dar*_*age 5 server-crashes hp hp-proliant vmware-esxi

硬件:类型:HP Proliant ML350 G5 RAM 22GB CPU 1 x Intel Xenon E5405 2.00GHz

OP:ESXi 5.5 刚刚从 5.1 更新,以尝试修复在相同硬件上的 ESXi 5.1 上发生的崩溃。

我试图找出我们的一台服务器崩溃的原因,它在 24 小时内有两次锁定。前面的内部错误灯闪烁红色,内部只有“#5 and #6 page 76 manual”“处理器 2”灯“琥珀色”和“电源”灯“绿色”闪烁。

在日志中,我在相关时间范围内可以看到的唯一错误是在日志下。这是原因吗?或者我还能做些什么来尝试记录/定位错误。

来自 zcat syslog.6.gz | 较少的

2014-05-26T11:55:47Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:55:47Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:55:47Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:53Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:57Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:01Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:04Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:15Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:56:17Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:56:17Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:23Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:27Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:31Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:46Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:48Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
Run Code Online (Sandbox Code Playgroud)

更新

设置 iLO 2 并访问日志确实显示了一些进度,我收到了很多已移除电源的消息。所以我开始怀疑电源,在移除UPS后,服务器已经稳定了5天。

Informational
iLO 2
05/29/2014 20:31
05/29/2014 20:31
1
Server power restored.
Informational
iLO 2
05/29/2014 20:31
05/29/2014 20:31
1
Server power removed.
Informational
iLO 2
05/29/2014 16:57
05/29/2014 16:57
1
Server power restored.
Informational
iLO 2
05/29/2014 16:57
05/29/2014 16:57
1
Server power removed.
Informational
iLO 2
05/29/2014 15:39
05/29/2014 15:39
1
Server power restored.
Informational
iLO 2
05/29/2014 15:39
05/29/2014 15:39
1
Server power removed. 
Run Code Online (Sandbox Code Playgroud)

更新 2

现在仍然不稳定 24 小时内再次崩溃 2 次

日志中相同

Informational
iLO 2
06/13/2014 05:21
06/13/2014 05:21
2
Server power removed.
Informational
iLO 2
06/13/2014 05:21
06/13/2014 05:21
3
Server power restored.
Run Code Online (Sandbox Code Playgroud)

发生这种情况后,iLO 界面会保持开启状态。Empty 中的 IML 日志不显示任何内容

在此处输入图片说明


更新 3

Status Summary  
    Server Name:    esx01.xx.xx; ProLiant ML350 G5
UUID:   32393534-3937-5A43-4A38-353130393248
Server Serial Number / Product ID:  CZJ851092H / 459279-425
System ROM:     D21  11/02/2008; backup system ROM: 11/02/2008
System Health:   Ok
Internal Health LED:     Ok
Server Power:   
 ON
UID Light:  
 OFF
Last Used Remote Console:       
Remote Console
Latest IML Entry:       IML Cleared (iLO 2 user:xxx)
iLO 2 Name:     ILOCZJ851092H
License Type:   iLO 2 Standard
iLO 2 Firmware Version:     1.61   08/31/2008
IP address:     192.168.2.2
Active Sessions:    iLO 2 user:xxx
Latest iLO 2 Event Log Entry:   Browser login: xxx - 172.20.1.105(DNS name not found).
iLO 2 Date/Time:    06/13/2014 23:22:52 
Run Code Online (Sandbox Code Playgroud)

eww*_*ite 7

您可能有硬件问题。这不是VMware ESXi 的问题。

  • 您使用的是哪个版本的 ESXi?
  • 服务器硬件/BIOS 的固件版本是什么?
  • 您提到的另一台 ESXi 主机是否由相同的硬件组成?

最好的办法是检查服务器的HP 集成管理日志(IML)。您可以通过ILO 2界面执行此操作。

  • 登录到 ILO,检查硬件系统状态选项卡。该主要摘要屏幕可能会告诉您出了什么问题。
  • 此外,请查看“系统状态”选项卡下的 IML 选项。这将告诉您服务器崩溃的原因。

就这样。您可能在此处遇到 RAM、CPU 或系统板问题。

在此处输入图片说明


编辑:请更新您主机的固件!!- 不要成为统计数据

适用于您系统的当前可引导固件 DVD的下载位于此处。请用它启动您的系统并让它更新所有组件。该服务器上的所有内容看起来都可以追溯到 2008 年。使用 HP 服务器硬件时,这是一个大禁忌。

  • @JamesRyan HP IML 日志不会*“撒谎”*。此服务器上的内部运行状况 LED 不是由软件触发的,并且在 ESXi 下 ProLiant 硬件特有的驱动程序集没有任何问题 - 这是直接的 Broadcom tg3、英特尔芯片组和 HP CCISS 阵列组件。这些是具有深厚安装基础的可靠驱动程序。内部的错误不会导致 OP 所看到的。 (3认同)
  • `o/~你知道我的日志不会说谎而且我开始觉得它是对的 o/~` (2认同)