DRM的bug太多,所以建议直接关闭。
alert日志:
Errors in file /oracle/app/oracle/diag/rdbms/gg/gg1/trace/gg1_lmon_60688126.trc:ORA-29702: error occurred in Cluster Group Service operation
No connectivity to other instances in the cluster during startup. Hence,LMON is terminating the instance. Please check the LMON trace file for details. Also,please check the network logs of this instance along with clusterwide network health for problems and then re-start this instance.
LMON (ospid: 63112654): terminating the instance
Dumping diagnostic data in directory=[cdmp_20170814161033],requested by (instance=1,osid=63112654 (LMON)),summary=[abnormal instance termination].
Instance terminated by LMON,pid = 63112654
LMON: 各实例的LMON进程会定期通信,以检查集群中各节点的健康状态,当某个节点出现故障时,负责集群重构、GRD恢复等操作,它提供的服务叫CGS(cluster group services)。LMON可以和下层的clusterware合作也可以单独工作。当LMON检测到实例级别的脑裂时,LMON会通知下层的clusterware,期待clusterware解决脑裂问题,但是RAC并不假设clusterware肯定能够解决问题,因此,LMON不会无尽等待clusterware层的处理结果。如果发生等待超时,LMON会自动触发IMR(instance membership recovery)IMR功能可以看做是oracle在数据库层提供的脑裂、IO隔离机制。
LMON主要是借助两种心跳机制来完成健康检测:
1.节点间的网络心跳。
2.控制文件的磁盘心跳。每个节点的CKPT进程每隔3S更新一次控制文件一个数据块。可以通过x$kcccp看到这个动作。sql>select inst_id,cphbt from x$kcccp
gg1_lmon_60688126.trc:
2017-08-14 16:07:40.381460 : kjfspseudorcfg: requested with reason 5(DRM Quiesce step stall)
* kjfcln: DRM aborted due to CGS rcfg.
*** 2017-08-14 16:07:44.621
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a50 rcfgtm 5 sec
*** 2017-08-14 16:07:49.605
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a55 rcfgtm 10 sec
*** 2017-08-14 16:07:54.581
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a5a rcfgtm 15 sec
............................................................................
*** 2017-08-14 16:08:59.675
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915a9b rcfgtm 80 sec
*** 2017-08-14 16:09:04.694
=====================================================
kjxgmpoll: CGS state (20 1) start 0x59915a4b cur 0x59915aa0 rcfgtm 85 sec
kjxgmpoll: the CGS reconfiguration has spent 85 seconds.
kjxgmpoll: terminate the CGS reconfig.
Error: Cluster Group Service reconfiguration takes too long
LMON caught an error 29702 in the main loop
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation
CGS reconfig的原因也正是由于DRM操作失败导致。