Linux NFS 在大约 15 分钟后挂起

Mac*_*lin 1 linux nfs linux-kernel

我对管理 Linux 和 NFS 有点陌生,所以请耐心等待。

我们正在尝试在工作中设置一个小集群。目前系统只有2台DELL高端工作站,运行CentOS 6.5。为了更轻松地管理用户和文件,我们决定通过 NFS 共享 /home 目录和 /etc 中的四个文件(passwdgroupshadowgshadow(这是通过将它们移动到一个子目录,并使用链接将它们放回 /etc))

这些文件在服务器上的 /etc/exports 中与此共享:

/home/  x.x.x.0/24(rw,sync,no_root_squash,no_all_squash)  
/etc/sub_dir/   x.x.x.0/24(rw,sync,no_root_squash,no_all_squash)  
Run Code Online (Sandbox Code Playgroud)

这些文件在客户端的 /etc/fstab 中安装:

server_name:/home/          /home/          nfs rw,sync,hard,intr 0 0
server_name:/etc/sub_dir/           /etc/sub_dir/           nfs rw,sync,hard,intr 0 0
Run Code Online (Sandbox Code Playgroud)

设置完所有这些后,系统运行了大约一个月。只要服务器打开,当客户端启动时,服务器的所有用户都是可见的,所有文件也是可见的。

然而,大约5天前,它开始行动了。启动客户端后,它可以正常工作约 15 分钟(给予或接受)。用户可以在本地或通过 SSH 登录。在最初的 15 分钟之后,系统几乎完全锁定。新用户无法登录,已登录的用户无法执行任何操作。(移动鼠标等基本操作仍然有效)。让系统再次运行的唯一方法是关闭客户端并重新打开。不幸的是,这也意味着在客户端上调试非常困难。

我们已将问题缩小到与上述文件的 NFS 共享有关的问题。(我们知道这是因为禁用 /etc/fstab 中的挂载允许客户端返回到它自己的本地文件,并且一切正常)

我们能想到的最远的就是系统启动、挂载所有东西并运行。然后,连接断开,下次客户端需要访问文件(例如 passwd)时,它找不到它,系统挂起等待连接。

计算机位于同一个 1000 Mbps 交换机上,负载相当轻。

任何帮助将非常感激。

更新

我一直在做更多的挖掘。我发现这个类似的问题在这里对服务器故障,但它也没有得到解决。
我也尝试切换到UDP,但这也没有解决问题。
我遇到了解释如何查找和解决过时的 NFS 的文章(但我不确定这就是问题所在)。

更新

我设法从客户端从 /var/log/messages 获取日志(在几分钟的操作期间)。
查看它,我发现 nfsidmap 被“阻止”的重复模式,然后是一个包含大量“[nfs]”条目的呼叫跟踪。

Aug  7 14:17:01 computer-name kernel: INFO: task crond:10578 blocked for more than 120 seconds.
Aug  7 14:17:01 computer-name kernel:      Tainted: P           ---------------    2.6.32-431.20.3.el6.x86_64 #1
Aug  7 14:17:01 computer-name kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  7 14:17:01 computer-name kernel: crond         D 000000000000000e     0 10578      1 0x00000080
Aug  7 14:17:01 computer-name kernel: ffff880a5cf0b148 0000000000000082 0000000000000000 ffffffff81059096
Aug  7 14:17:01 computer-name kernel: ffff880a5cf0b0d8 ffff880a5f77eaa0 ffff880a5cf0b0d8 ffffffff8105559d
Aug  7 14:17:01 computer-name kernel: ffff880a555a5098 ffff880a5cf0bfd8 000000000000fbc8 ffff880a555a5098
Aug  7 14:17:01 computer-name kernel: Call Trace:
Aug  7 14:17:01 computer-name kernel: [<ffffffff81059096>] ? enqueue_task+0x66/0x80
Aug  7 14:17:01 computer-name kernel: [<ffffffff8105559d>] ? check_preempt_curr+0x6d/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff815296d5>] schedule_timeout+0x215/0x2e0
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109afb6>] ? autoremove_wake_function+0x16/0x40
Aug  7 14:17:01 computer-name kernel: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff81529353>] wait_for_common+0x123/0x180
Aug  7 14:17:01 computer-name kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff81095211>] ? __queue_work+0x41/0x50
Aug  7 14:17:01 computer-name kernel: [<ffffffff8152946d>] wait_for_completion+0x1d/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109386c>] call_usermodehelper_exec+0x10c/0x120
Aug  7 14:17:01 computer-name kernel: [<ffffffff812246ae>] call_sbin_request_key+0x24e/0x2f0
Aug  7 14:17:01 computer-name kernel: [<ffffffff8121eb03>] ? key_instantiate_and_link+0xa3/0xb0
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1060030>] ? nfs4_callback_layoutrecall+0x30/0x90 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffff812241e5>] request_key_and_link+0x315/0x3d0
Aug  7 14:17:01 computer-name kernel: [<ffffffff812243b0>] request_key+0x50/0xa0
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105cb65>] nfs_idmap_request_key+0xc5/0x170 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105d194>] nfs_idmap_lookup_id+0x34/0x80 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105d5d5>] nfs_map_name_to_uid+0x75/0xa0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057504>] decode_getfattr_attrs+0xf64/0xfa0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
Aug  7 14:17:01 computer-name kernel: [<ffffffffa10575c3>] decode_getfattr_generic.clone.0+0x83/0xe0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057ce0>] nfs4_xdr_dec_access+0xb0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057c30>] ? nfs4_xdr_dec_access+0x0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f90fc4>] rpcauth_unwrap_resp+0x84/0xb0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057c30>] ? nfs4_xdr_dec_access+0x0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f85923>] call_decode+0x1b3/0x800 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109b020>] ? wake_bit_function+0x0/0x50
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f85770>] ? call_decode+0x0/0x800 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f8f677>] __rpc_execute+0x77/0x350 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109ae27>] ? bit_waitqueue+0x17/0xd0
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f8f9b1>] rpc_execute+0x61/0xa0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f863a5>] rpc_run_task+0x75/0x90 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f864c2>] rpc_call_sync+0x42/0x70 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104ba9e>] _nfs4_call_sync+0x3e/0x40 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104a7cc>] _nfs4_proc_access+0x11c/0x1a0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104a89b>] nfs4_proc_access+0x4b/0x80 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa102658c>] nfs_do_access+0x19c/0x240 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f92625>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f915f0>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa10266d8>] nfs_permission+0xa8/0x1e0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffff81198e93>] __link_path_walk+0xb3/0x1000
Aug  7 14:17:01 computer-name kernel: [<ffffffff81199abf>] __link_path_walk+0xcdf/0x1000
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119a09a>] path_walk+0x6a/0xe0
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119a2ab>] filename_lookup+0x6b/0xc0
Aug  7 14:17:01 computer-name kernel: [<ffffffff81226c26>] ? security_file_alloc+0x16/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119b784>] do_filp_open+0x104/0xd20
Aug  7 14:17:01 computer-name kernel: [<ffffffff8128f70a>] ? strncpy_from_user+0x4a/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff811a8a62>] ? alloc_fd+0x92/0x160
Aug  7 14:17:01 computer-name kernel: [<ffffffff81185ba9>] do_sys_open+0x69/0x140
Aug  7 14:17:01 computer-name kernel: [<ffffffff81185cc0>] sys_open+0x20/0x30
Aug  7 14:17:01 computer-name kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Aug  7 14:17:01 computer-name kernel: INFO: task nfsidmap:13767 blocked for more than 120 seconds.
Aug  7 14:17:01 computer-name kernel:      Tainted: P           ---------------    2.6.32-431.20.3.el6.x86_64 #1
Aug  7 14:17:01 computer-name kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  7 14:17:01 computer-name kernel: nfsidmap      D 000000000000000e     0 13767  13766 0x00000080
Aug  7 14:17:01 computer-name kernel: ffff88145d1e93f8 0000000000000086 0000000000000000 ffff880a5bae6080
Aug  7 14:17:01 computer-name kernel: ffff88145d1e9378 ffffffff814b1d35 0000000053e3c1ad 0000000028930cb7
Aug  7 14:17:01 computer-name kernel: ffff88145139f058 ffff88145d1e9fd8 000000000000fbc8 ffff88145139f058
Aug  7 14:17:01 computer-name kernel: Call Trace:
Aug  7 14:17:01 computer-name kernel: [<ffffffff814b1d35>] ? tcp_event_new_data_sent+0xb5/0x110
Aug  7 14:17:01 computer-name kernel: [<ffffffff81223d90>] ? key_wait_bit+0x0/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff81223d9e>] key_wait_bit+0xe/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff81529a8f>] __wait_on_bit+0x5f/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff81223d90>] ? key_wait_bit+0x0/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff81529b38>] out_of_line_wait_on_bit+0x78/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109b020>] ? wake_bit_function+0x0/0x50
Aug  7 14:17:01 computer-name kernel: [<ffffffff81223d7e>] wait_for_key_construction+0x6e/0x80
Aug  7 14:17:01 computer-name kernel: [<ffffffff812243c5>] request_key+0x65/0xa0
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105cb65>] nfs_idmap_request_key+0xc5/0x170 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105d194>] nfs_idmap_lookup_id+0x34/0x80 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa105d5d5>] nfs_map_name_to_uid+0x75/0xa0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057504>] decode_getfattr_attrs+0xf64/0xfa0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
Aug  7 14:17:01 computer-name kernel: [<ffffffffa10575c3>] decode_getfattr_generic.clone.0+0x83/0xe0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057ce0>] nfs4_xdr_dec_access+0xb0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057c30>] ? nfs4_xdr_dec_access+0x0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f90fc4>] rpcauth_unwrap_resp+0x84/0xb0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa1057c30>] ? nfs4_xdr_dec_access+0x0/0xc0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f85923>] call_decode+0x1b3/0x800 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109b020>] ? wake_bit_function+0x0/0x50
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f85770>] ? call_decode+0x0/0x800 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f8f677>] __rpc_execute+0x77/0x350 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffff8109ae27>] ? bit_waitqueue+0x17/0xd0
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f8f9b1>] rpc_execute+0x61/0xa0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f863a5>] rpc_run_task+0x75/0x90 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f864c2>] rpc_call_sync+0x42/0x70 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104ba9e>] _nfs4_call_sync+0x3e/0x40 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104a7cc>] _nfs4_proc_access+0x11c/0x1a0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa104a89b>] nfs4_proc_access+0x4b/0x80 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa102658c>] nfs_do_access+0x19c/0x240 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f92625>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa0f915f0>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
Aug  7 14:17:01 computer-name kernel: [<ffffffffa10266d8>] nfs_permission+0xa8/0x1e0 [nfs]
Aug  7 14:17:01 computer-name kernel: [<ffffffff81198e93>] __link_path_walk+0xb3/0x1000
Aug  7 14:17:01 computer-name kernel: [<ffffffff81199abf>] __link_path_walk+0xcdf/0x1000
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119a09a>] path_walk+0x6a/0xe0
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119a2ab>] filename_lookup+0x6b/0xc0
Aug  7 14:17:01 computer-name kernel: [<ffffffff81226c26>] ? security_file_alloc+0x16/0x20
Aug  7 14:17:01 computer-name kernel: [<ffffffff8119b784>] do_filp_open+0x104/0xd20
Aug  7 14:17:01 computer-name kernel: [<ffffffff811a27e8>] ? d_free+0x58/0x60
Aug  7 14:17:01 computer-name kernel: [<ffffffff8128f70a>] ? strncpy_from_user+0x4a/0x90
Aug  7 14:17:01 computer-name kernel: [<ffffffff811a8a62>] ? alloc_fd+0x92/0x160
Aug  7 14:17:01 computer-name kernel: [<ffffffff81185ba9>] do_sys_open+0x69/0x140
Aug  7 14:17:01 computer-name kernel: [<ffffffff81185cc0>] sys_open+0x20/0x30
Aug  7 14:17:01 computer-name kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Run Code Online (Sandbox Code Playgroud)

(抱歉篇幅太长,我不知道哪些部分有用)
这种模式每两分钟重复一次。

根据thisthis,该消息表明存在某种资源匮乏。但是,客户端通常是空闲的。

Gio*_*oni 5

内核错误消息表明它无法安排进程运行 120 秒。它要么是极高的 CPU 使用率,要么是 I/O 级别的争用。

我建议不要使用 NFS 来共享系统关键文件,如 /etc/passwd 甚至符号链接,因为 NFS 操作非常依赖于它们。您可以考虑设置一个脚本以通过 SCP 传输它们并覆盖当前的脚本,但随后您必须考虑确定哪个服务器具有更新文件的逻辑。

长期的解决方案是采用 LDAP。

编辑:根据评论中提供的其他信息,从 NFSv4 更改为 NFSv3 是一种替代方法。