In this Document
APPLIES TO:
Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.
PURPOSE
The purpose of this note is to explain split-brain node evictions in Oracle Clusterware release 11.2
SCOPE
The intended audience of this note is Oracle Clusterware 11.2 administrators at any level of expertise. As written,this note applies only to 11.2.
DETAILS
Missed network heartbeat (NHB) evictions happen when ocssd of the surviving node loses contact with the evicted node over the interconnect. The nodes must be able to communicate over the interconnect to avoid a "split brain" situation. In the case of a "split brain" node eviction,one node aborted itself to avoid "split brain" when communication over the interconnect was compromised.
What does "split brain" mean?
"Split brain" means that there are 2 or more distinct sets of nodes,or "cohorts",with no communication between the two cohorts.
For example:
Suppose there are 4 nodes named A,B,C,D,in the following situation
* Nodes A,B can talk to each other; nodes C,D can talk to each other
* But A and B cannot talk to C or D,and vice versa
Then there are two cohorts: {A,B} and {C,D}.
Why is this a problem?
In a split-brain situation,there are in a sense two (or more) separate clusters working on the same shared storage. This has the potential for data corruption. So the split-brain must be resolved.
How does the clusterware resolve a "split brain" situation?
Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort.
If both of the cohorts are the same size,the cohort with the lowest numbered node in it survives.
The clusterware identifies the LARGEST cohort,and aborts all the nodes which do NOT belong to that cohort.
Identifying a split-brain eviction
In a split-brain node eviction,the following message is present in the ocssd log ($GRID_HOME/log/<hostname>/ocssd/ocssd.log) of the evicted node:
clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
And earlier in the same log,within 10 minutes prior to "clssnmCheckDskInfo: Aborting local node" message:
@H_
403_187@
clssnmPollingThread: node %s (%n) at <X>% heartbeat fatal,removal in...
The split-brain message in the ocssd.log will show "cohort" information. For example:
_187@
2012-12-28 20:26:25.803: [ CSSD][1111296320]clssnmCheckDskInfo: My cohort: 1
2012-12-28 20:26:25.803: [ CSSD][1111296320]clssnmCheckDskInfo: Surviving cohort: 2,3,4
2012-12-28 20:26:25.803: [ CSSD][1111296320](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 1,sprora01,is smaller than cohort of 3 nodes led by node 2,sprora02,based on map type 2