qua*_*nta 5 linux cluster high-availability drbd ocfs2
我已经用 OCFS2 替换了在双主模式下运行的死节点。所有步骤都有效:
/proc/drbd
version: 8.3.13 (api:88/proto:86-96)
GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by mockbuild@builder10.centos.org, 2012-05-07 11:56:36
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:81 nr:407832 dw:106657970 dr:266340 al:179 bm:6551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
Run Code Online (Sandbox Code Playgroud)
直到我尝试挂载卷:
mount -t ocfs2 /dev/drbd1 /data/webroot/
mount.ocfs2: Transport endpoint is not connected while mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more information on this error.
Run Code Online (Sandbox Code Playgroud)
/var/log/kern.log
kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.
kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_try_to_join_domain:1210 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_join_domain:1488 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_register_domain:1754 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):ocfs2_dlm_init:2808 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):ocfs2_mount_volume:1447 ERROR: status = -107
kernel: ocfs2: Unmounting device (147,1) on (node 1)
Run Code Online (Sandbox Code Playgroud)
以下是节点 0 (192.168.3.145) 上的内核日志:
kernel: : (swapper,0,7):o2net_listen_data_ready:1894 bytes: 0
kernel: : (o2net,4024,3):o2net_accept_one:1800 attempt to connect from unknown node at 192.168.2.93
:43868
kernel: : (o2net,4024,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
kernel: : (o2net,4024,3):o2net_set_nn_state:478 node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0 -> -107
Run Code Online (Sandbox Code Playgroud)
我确定/etc/ocfs2/cluster.conf
在两个节点上都是相同的:
/etc/ocfs2/cluster.conf
node:
ip_port = 7777
ip_address = 192.168.3.145
number = 0
name = SVR233NTC-3145.localdomain
cluster = cpc
node:
ip_port = 7777
ip_address = 192.168.2.93
number = 1
name = SVR022-293.localdomain
cluster = cpc
cluster:
node_count = 2
name = cpc
Run Code Online (Sandbox Code Playgroud)
它们连接良好:
# nc -z 192.168.3.145 7777
Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded!
Run Code Online (Sandbox Code Playgroud)
但 O2CB 心跳在新节点 (192.168.2.93) 上不活动:
/etc/init.d/o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster cpc: Online
Heartbeat dead threshold = 31
Network idle timeout: 30000
Network keepalive delay: 2000
Network reconnect delay: 2000
Checking O2CB heartbeat: Not active
Run Code Online (Sandbox Code Playgroud)
以下是在tcpdump
节点 0 上运行而ocfs2
在节点 1 上启动时的结果:
1 0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274 > cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180 TSecr=0
2 0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt > 55274 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSval=707657223 TSecr=690432180
3 0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274 > cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181 TSecr=707657223
4 0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274 > cbt [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181 TSecr=707657223
5 0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181
6 0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [RST, ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181
Run Code Online (Sandbox Code Playgroud)
该RST
标志是每6包后发送。
我还能做些什么来调试这个案例?
PS:
节点 0 上的 OCFS2 版本:
节点 1 上的 OCFS2 版本:
更新 1 - 2012 年 12 月 23 日星期日 18:15:07 ICT
两个节点是否在同一个局域网段上?没有路由器之类的?
不,它们是不同子网上的 2 个 VMWare 服务器。
哦,虽然我记得 - 主机名/DNS 全部设置并正常工作?
当然,我将每个节点的主机名和 IP 地址都添加到/etc/hosts
:
192.168.2.93 SVR022-293.localdomain
192.168.3.145 SVR233NTC-3145.localdomain
Run Code Online (Sandbox Code Playgroud)
他们可以通过主机名相互连接:
# nc -z SVR022-293.localdomain 7777
Connection to SVR022-293.localdomain 7777 port [tcp/cbt] succeeded!
# nc -z SVR233NTC-3145.localdomain 7777
Connection to SVR233NTC-3145.localdomain 7777 port [tcp/cbt] succeeded!
Run Code Online (Sandbox Code Playgroud)
更新 2 - 2012 年 12 月 24 日星期一 18:32:15 ICT
找到线索:我的同事/etc/ocfs2/cluster.conf
在集群运行时手动编辑了文件。因此,它仍然将死节点信息保留在/sys/kernel/config/cluster/
:
# ls -l /sys/kernel/config/cluster/cpc/node/
total 0
drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR150-4107.localdomain
drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR233NTC-3145.localdomain
Run Code Online (Sandbox Code Playgroud)
(SVR150-4107.localdomain
在这种情况下)
我将停止集群以删除死节点,但出现以下错误:
# /etc/init.d/o2cb stop
Stopping O2CB cluster cpc: Failed
Unable to stop cluster as heartbeat region still active
Run Code Online (Sandbox Code Playgroud)
我确定该ocfs2
服务已经停止:
# mounted.ocfs2 -f
Device FS Nodes
/dev/sdb ocfs2 Not mounted
/dev/drbd1 ocfs2 Not mounted
Run Code Online (Sandbox Code Playgroud)
没有参考了:
# ocfs2_hb_ctl -I -u 12963EAF4E16484DB81ECB0251177C26
12963EAF4E16484DB81ECB0251177C26: 0 refs
Run Code Online (Sandbox Code Playgroud)
我还卸载了ocfs2
内核模块以确保:
# ps -ef | grep [o]cfs2
root 12513 43 0 18:25 ? 00:00:00 [ocfs2_wq]
# modprobe -r ocfs2
# ps -ef | grep [o]cfs2
# lsof | grep ocfs2
Run Code Online (Sandbox Code Playgroud)
但没有任何变化:
# /etc/init.d/o2cb offline
Stopping O2CB cluster cpc: Failed
Unable to stop cluster as heartbeat region still active
Run Code Online (Sandbox Code Playgroud)
那么最后一个问题是:如何在不重启的情况下删除死节点信息?
更新 3 - 2012 年 12 月 24 日星期一 22:41:51 ICT
以下是所有正在运行的心跳线程:
# ls -l /sys/kernel/config/cluster/cpc/heartbeat/ | grep '^d'
drwxr-xr-x 2 root root 0 Dec 24 22:18 72EF09EA3D0D4F51BDC00B47432B1EB2
Run Code Online (Sandbox Code Playgroud)
此心跳区域的引用计数:
# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2
72EF09EA3D0D4F51BDC00B47432B1EB2: 7 refs
Run Code Online (Sandbox Code Playgroud)
尝试杀死:
# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
Run Code Online (Sandbox Code Playgroud)
有任何想法吗?
哦耶!问题解决了。
注意UUID:
# mounted.ocfs2 -d
Device FS Stack UUID Label
/dev/sdb ocfs2 o2cb 12963EAF4E16484DB81ECB0251177C26 ocfs2_drbd1
/dev/drbd1 ocfs2 o2cb 12963EAF4E16484DB81ECB0251177C26 ocfs2_drbd1
Run Code Online (Sandbox Code Playgroud)
但:
# ls -l /sys/kernel/config/cluster/cpc/heartbeat/
drwxr-xr-x 2 root root 0 Dec 24 22:53 72EF09EA3D0D4F51BDC00B47432B1EB2
Run Code Online (Sandbox Code Playgroud)
发生这种情况的原因是我“不小心”强制重新格式化了 OCFS2 卷。我面临的问题与Ocfs2 用户邮件列表上的问题类似。
这也是出现以下错误的原因:
ocfs2_hb_ctl:停止心跳时 ocfs2_lookup 未找到文件
因为在.ocfs2_hb_ctl
72EF09EA3D0D4F51BDC00B47432B1EB2
/proc/partitions
我想到了一个想法:我可以更改 OCFS2 卷的 UUID吗?
浏览tunefs.ocfs2
手册页:
Usage: tunefs.ocfs2 [options] <device> [new-size]
tunefs.ocfs2 -h|--help
tunefs.ocfs2 -V|--version
[options] can be any mix of:
-U|--uuid-reset[=new-uuid]
Run Code Online (Sandbox Code Playgroud)
所以我执行以下命令:
# tunefs.ocfs2 --uuid-reset=72EF09EA3D0D4F51BDC00B47432B1EB2 /dev/drbd1
WARNING!!! OCFS2 uses the UUID to uniquely identify a file system.
Having two OCFS2 file systems with the same UUID could, in the least,
cause erratic behavior, and if unlucky, cause file system damage.
Please choose the UUID with care.
Update the UUID ?yes
Run Code Online (Sandbox Code Playgroud)
核实:
# tunefs.ocfs2 -Q "%U\n" /dev/drbd1
72EF09EA3D0D4F51BDC00B47432B1EB2
Run Code Online (Sandbox Code Playgroud)
尝试再次杀死心跳区域,看看会发生什么:
# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2
# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2
72EF09EA3D0D4F51BDC00B47432B1EB2: 6 refs
Run Code Online (Sandbox Code Playgroud)
继续杀死直到我看到0 refs
然后关闭集群:
# /etc/init.d/o2cb offline cpc
Stopping O2CB cluster cpc: OK
Run Code Online (Sandbox Code Playgroud)
并停止它:
# /etc/init.d/o2cb stop
Stopping O2CB cluster cpc: OK
Unloading module "ocfs2": OK
Unmounting ocfs2_dlmfs filesystem: OK
Unloading module "ocfs2_dlmfs": OK
Unmounting configfs filesystem: OK
Unloading module "configfs": OK
Run Code Online (Sandbox Code Playgroud)
重新启动查看新节点是否更新:
# /etc/init.d/o2cb start
Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting O2CB cluster cpc: OK
# ls -l /sys/kernel/config/cluster/cpc/node/
total 0
drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR022-293.localdomain
drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR233NTC-3145.localdomain
Run Code Online (Sandbox Code Playgroud)
OK,在对等节点(192.168.2.93)上,尝试启动 OCFS2:
# /etc/init.d/ocfs2 start
Starting Oracle Cluster File System (OCFS2) [ OK ]
Run Code Online (Sandbox Code Playgroud)
感谢Sunil Mushran,因为该帖子帮助我解决了问题。
教训是:
归档时间: |
|
查看次数: |
10435 次 |
最近记录: |