Newbie questions - 2008R2 2-Node Cluster and Failures

Trying to troubleshot the sequence of events of an outage on a 2-node 2008 R2 MSCS based cluster (we have an IP address and SQL Server instance clustered). Will refer to nodes as NODE05 and NODE06.

Both nodes are running on VMware ESXi 5.x with their database and quorum disks attached via VMware RDM to an IBM XIV via fiber channel. NODE06's RDM is set up to use fixed-path addressing while NODE05's RDM is (incorrectly) set to use round-robin multipath (working to correct this). Each node is running on a different blade center. There is a private heartbeat network.

At the beginning of this event, NODE05 is primary.

At 22:08:20, both nodes report that NODE05 was removed from active failover cluster membership. 23 seconds later, both nodes report that disks were 'unexpectedly lost' by the respective node. These errors continue on for more than two minutes before things seem to come back online.

Our investigation shows that, at least from the ESX perspective, connectivity to the SAN LUNs was not lost. SAN monitoring also shows nothing "dropping". In addition, I'm not seeing anything in the OS/System event logs indicating the storage was lost -- the disk errors show up only in the cluster logs. So I'm not thinking that some sort of SAN disruption was the trigger for this event, but want to ensure that theory fits with how a 2-node MSCS cluster functions.

I'm theorizing that the node remove event (possibly triggered by a network disruption) that occurred first may have triggered SCSI-3 based "fencing" which would have resulted in the disks appearing unavailable on both nodes even though the SAN was still up. However, my understanding is that the SCSI reservation requests subsequent to the SCSI reset that occurs in a "split" like this happens at staggered intervals (three seconds for the "primary" node and seven seconds for"challenger" nodes) which really should be resolved fairly quickly -- not the 2+ minutes we saw.

Can someone confirm that I'm on the right track with my thinking? Or possibly describe how a typical failure scenario would play out if the heartbeat network was disrupted for a period of time?

Newbie questions - 2008R2 2-Node Cluster and Failures

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112