的/ proc / DRBD
version: 8.3.13 (api:88/proto:86-96) GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by mockbuild@builder10.centos.org,2012-05-07 11:56:36 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- ns:81 nr:407832 dw:106657970 dr:266340 al:179 bm:6551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
直到我尝试装入卷:
mount -t ocfs2 /dev/drbd1 /data/webroot/ mount.ocfs2: Transport endpoint is not connected while mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more information on this error.
/var/log/kern.log
kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds,giving up and returning errors. kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR: status = -107 kernel: (mount.ocfs2,1):dlm_try_to_join_domain:1210 ERROR: status = -107 kernel: (mount.ocfs2,1):dlm_join_domain:1488 ERROR: status = -107 kernel: (mount.ocfs2,1):dlm_register_domain:1754 ERROR: status = -107 kernel: (mount.ocfs2,1):ocfs2_dlm_init:2808 ERROR: status = -107 kernel: (mount.ocfs2,1):ocfs2_mount_volume:1447 ERROR: status = -107 kernel: ocfs2: Unmounting device (147,1) on (node 1)
以下是节点0(192.168.3.145)上的内核日志:
kernel: : (swapper,7):o2net_listen_data_ready:1894 bytes: 0 kernel: : (o2net,4024,3):o2net_accept_one:1800 attempt to connect from unknown node at 192.168.2.93 :43868 kernel: : (o2net,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds,giving up and returning errors. kernel: : (o2net,3):o2net_set_nn_state:478 node 1 sc: 0000000000000000 -> 0000000000000000,valid 0 -> 0,err 0 -> -107
我确定两个节点上的/etc/ocfs2/cluster.conf是相同的:
/etc/ocfs2/cluster.conf
node: ip_port = 7777 ip_address = 192.168.3.145 number = 0 name = SVR233NTC-3145.localdomain cluster = cpc node: ip_port = 7777 ip_address = 192.168.2.93 number = 1 name = SVR022-293.localdomain cluster = cpc cluster: node_count = 2 name = cpc
他们连接得很好:
# nc -z 192.168.3.145 7777 Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded!
但O2CB心跳在新节点上不活动(192.168.2.93):
/etc/init.d/o2cb状态
Driver for "configfs": Loaded Filesystem "configfs": Mounted Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster cpc: Online Heartbeat dead threshold = 31 Network idle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Not active
以下是在节点1上运行tcpdump同时在节点1上启动ocfs2时的结果:
1 0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274 > cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180 TSecr=0 2 0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt > 55274 [SYN,ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSval=707657223 TSecr=690432180 3 0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274 > cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181 TSecr=707657223 4 0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274 > cbt [PSH,ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181 TSecr=707657223 5 0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181 6 0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [RST,ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181
每6个数据包发送一次RST标志.
我还可以做些什么来调试这个案例?
PS:
节点0上的OCFS2版本:
> ocfs2-tools-1.4.4-1.el5
> ocfs2-2.6.18-274.12.1.el5-1.4.7-1.el5
节点1上的OCFS2版本:
> ocfs2-tools-1.4.4-1.el5
> ocfs2-2.6.18-308.el5-1.4.7-1.el5
更新1 – Sun Dec 23 18:15:07 ICT 2012
Are both nodes on the same lan segment? No routers etc.?
不,它们是不同子网上的2个VMWare服务器.
Oh,while I remember – hostnames/DNS all setup and working correctly?
当然,我在/ etc / hosts中添加了每个节点的主机名和IP地址:
192.168.2.93 SVR022-293.localdomain 192.168.3.145 SVR233NTC-3145.localdomain
并且他们可以通过主机名相互连接:
# nc -z SVR022-293.localdomain 7777 Connection to SVR022-293.localdomain 7777 port [tcp/cbt] succeeded! # nc -z SVR233NTC-3145.localdomain 7777 Connection to SVR233NTC-3145.localdomain 7777 port [tcp/cbt] succeeded!
更新2 – 星期一12月24日18:32:15 ICT 2012
找到了线索:我的同事在群集运行时手动编辑了/etc/ocfs2/cluster.conf文件.因此,它仍然将死节点信息保存在/ sys / kernel / config / cluster /中:
# ls -l /sys/kernel/config/cluster/cpc/node/ total 0 drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR150-4107.localdomain drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR233NTC-3145.localdomain
(在这种情况下为SVR150-4107.localdomain)
# /etc/init.d/o2cb stop Stopping O2CB cluster cpc: Failed Unable to stop cluster as heartbeat region still active
我确定ocfs2服务已经停止:
# mounted.ocfs2 -f Device FS Nodes /dev/sdb ocfs2 Not mounted /dev/drbd1 ocfs2 Not mounted
没有参考了:
# ocfs2_hb_ctl -I -u 12963EAF4E16484DB81ECB0251177C26 12963EAF4E16484DB81ECB0251177C26: 0 refs
我还卸载了ocfs2内核模块以确保:
# ps -ef | grep [o]cfs2 root 12513 43 0 18:25 ? 00:00:00 [ocfs2_wq] # modprobe -r ocfs2 # ps -ef | grep [o]cfs2 # lsof | grep ocfs2
但没有变化:
# /etc/init.d/o2cb offline Stopping O2CB cluster cpc: Failed Unable to stop cluster as heartbeat region still active
所以最后一个问题是:如何在不重启的情况下删除死节点信息?
更新3 – 星期一12月24日22:41:51 ICT 2012
这是所有正在运行的心跳线程:
# ls -l /sys/kernel/config/cluster/cpc/heartbeat/ | grep '^d' drwxr-xr-x 2 root root 0 Dec 24 22:18 72EF09EA3D0D4F51BDC00B47432B1EB2
此心跳区域的引用计数:
# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2 72EF09EA3D0D4F51BDC00B47432B1EB2: 7 refs
试着杀死:
# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2 ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
有任何想法吗?
解决方法
注意UUID:
# mounted.ocfs2 -d Device FS Stack UUID Label /dev/sdb ocfs2 o2cb 12963EAF4E16484DB81ECB0251177C26 ocfs2_drbd1 /dev/drbd1 ocfs2 o2cb 12963EAF4E16484DB81ECB0251177C26 ocfs2_drbd1
但:
# ls -l /sys/kernel/config/cluster/cpc/heartbeat/ drwxr-xr-x 2 root root 0 Dec 24 22:53 72EF09EA3D0D4F51BDC00B47432B1EB2
这可能发生,因为我“意外”强制重新形成OCFS2卷.我面临的问题类似于Ocfs2用户邮件列表上的this.
这也是以下错误的原因:
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
因为ocfs2_hb_ctl在/ proc / partitions中找不到具有UUID 72EF09EA3D0D4F51BDC00B47432B1EB2的设备.
我想到了一个想法:我可以更改OCFS2卷的UUID吗?
浏览tunefs.ocfs2手册页:
Usage: tunefs.ocfs2 [options] <device> [new-size] tunefs.ocfs2 -h|--help tunefs.ocfs2 -V|--version [options] can be any mix of: -U|--uuid-reset[=new-uuid]
所以我执行以下命令:
# tunefs.ocfs2 --uuid-reset=72EF09EA3D0D4F51BDC00B47432B1EB2 /dev/drbd1 WARNING!!! OCFS2 uses the UUID to uniquely identify a file system. Having two OCFS2 file systems with the same UUID could,in the least,cause erratic behavior,and if unlucky,cause file system damage. Please choose the UUID with care. Update the UUID ?yes
校验:
# tunefs.ocfs2 -Q "%U\n" /dev/drbd1 72EF09EA3D0D4F51BDC00B47432B1EB2
试图再次杀死心跳区域,看看会发生什么:
# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2 # ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2 72EF09EA3D0D4F51BDC00B47432B1EB2: 6 refs
继续杀戮,直到我看到0引用然后关闭群集:
# /etc/init.d/o2cb offline cpc Stopping O2CB cluster cpc: OK
并阻止它:
# /etc/init.d/o2cb stop Stopping O2CB cluster cpc: OK Unloading module "ocfs2": OK Unmounting ocfs2_dlmfs filesystem: OK Unloading module "ocfs2_dlmfs": OK Unmounting configfs filesystem: OK Unloading module "configfs": OK
重新开始查看新节点是否已更新:
# /etc/init.d/o2cb start Loading filesystem "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading filesystem "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Starting O2CB cluster cpc: OK # ls -l /sys/kernel/config/cluster/cpc/node/ total 0 drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR022-293.localdomain drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR233NTC-3145.localdomain
好的,在对等节点(192.168.2.93)上,尝试启动OCFS2:
# /etc/init.d/ocfs2 start Starting Oracle Cluster File System (OCFS2) [ OK ]
感谢Sunil Mushran,因为this线程帮助我解决了这个问题.
课程是:
> IP地址,端口,…只能在群集发生时更改
脱机.见
FAQ.>永远不要强制重新格式化OCFS2卷.