Trying to troubleshot the sequence of events of an outage on a 2-node 2008 R2 MSCS based cluster (we have an IP address and SQL Server instance clustered). Will refer to nodes as NODE05 and NODE06.
Both nodes are running on VMware ESXi 5.x with their database and quorum disks attached via VMware RDM to an IBM XIV via fiber channel. NODE06's RDM is set up to use fixed-path addressing while NODE05's RDM is (incorrectly) set to use round-robin multipath (working to correct this). Each node is running on a different blade center. There is a private heartbeat network.
At the beginning of this event, NODE05 is primary.
At 22:08:20, both nodes report that NODE05 was removed from active failover cluster membership. 23 seconds later, both nodes report that disks were 'unexpectedly lost' by the respective node. These errors continue on for more than two minutes
before things seem to come back online.
Our investigation shows that, at least from the ESX perspective, connectivity to the SAN LUNs was not lost. SAN monitoring also shows nothing "dropping". In addition, I'm not seeing anything in the OS/System event logs indicating the
storage was lost -- the disk errors show up only in the cluster logs. So I'm not thinking that some sort of SAN disruption was the trigger for this event, but want to ensure that theory fits with how a 2-node MSCS cluster functions.
I'm theorizing that the node remove event (possibly triggered by a network disruption) that occurred first may have triggered SCSI-3 based "fencing" which would have resulted in the disks appearing unavailable on both nodes even though the SAN was still up. However, my understanding is that the SCSI reservation requests subsequent to the SCSI reset that occurs in a "split" like this happens at staggered intervals (three seconds for the "primary" node and seven seconds for"challenger" nodes) which really should be resolved fairly quickly -- not the 2+ minutes we saw.
Can someone confirm that I'm on the right track with my thinking? Or possibly describe how a typical failure scenario would play out if the heartbeat network was disrupted for a period of time?