Hello,
I have setup SQL Server 2012 SE cluster environment on Windows Server 2008 R2 EE for our production. This is an Active Passive setup with just one instance. The environment is based of a Shared iSCSI storage with same network for public and private. I know using the same network will cause ping requests to saturate and might cause failover. These nodes are built on virtual machines.
Last friday we had an outage for few mins. We dont know what happened but windows cluster event viewer/event vwr shows nothing but these error messages:
1. Cluster node 'Node1' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.2. Cluster resource 'Cluster Disk 2' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.
3. Cluster resource 'Cluster Disk 1' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.
4. Cluster resource 'SQL IP Address 1 ()' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.
5. Cluster resource 'Cluster Disk 4' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.
6. Cluster resource 'Cluster Disk 3' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.
7. The Cluster service failed to bring clustered service or application 'SQL Server (MSSQLSERVER)' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
8. Cluster node 'Node2' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
9. The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
10. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
My thoughts:
However, what I am thinking what triggered the failover is ping request saturation caused the failover. However even if cluster service tried to failover from Node1 to Node2 it didnot do that. Rather it took the entire instance down causing an outage. It complained that quorum was not vailable for a failover.
Next steps:
I logged into the second node immediately and went to failover cluster manager and tried to bring SQL Server service online and it came up fine and we were back online. I verified sql server logs and found nothing there. I verified with Storage admin if a drive failure caused this issue but everything looked normal at storage end. I then talked to network admin and told him we need a seperate network for private communication between the two nodes and he checked the network logs and told me that our network bandwidth is 2GB and at the time failover occurred there were no spikes in the network and only 100MB of data was being sent/received at that time.
Experts is there a location that I can check and confirm what triggered the failover and upon the trigger why was the other node not able to bring the instance online and caused the outage? Or is there a way that I can prove that ping saturation triggered the failover?
Thanks for the valuable inputs.