Quantcast
Channel: High Availability (Clustering) forum
Viewing all articles
Browse latest Browse all 5654

2012 R2 Cluster - Active Node ejects all other nodes - random times

$
0
0

ISSUE

We have a 4  node 2012 R2 Cluster - Active\Passive \File Share\ and Passive DR Server

Our issue is that our active node appears to be losing all cluster communication and is ejecting all other nodes and we can not find any system event log items to indicate loss of local area connection or issues with network dropping. We have a third party monitoring tool that during these events has never lost a ping to this system showing it as down.

Our current Band-Aid fix is to set the Cluster Service to restart automatically after failure. This gets the cluster back online after 60 seconds but we are still down for 60 seconds. We have not enabled automatic failover due to fact that all applications have not been tested on node 2 of production as of yet.

Here are the variables for our environment.

Cluster is physical on Dell Hardware. Current network team shows no errors within Open Manage SA.

Network team shows no indication of flapping on the switch.

Systems:

Active - SQL-CL02 - 1 Vote (Active Cluster Owner)

Passive- SQL-CL03 - 1 Vote

File share - WIN2012-FS01 - 1 Vote

PassiveDR- SQL-CL01 - 0 Vote

Cluster Networking Info:

Production - Network in use for cluster communications.

10.100.1.7/26

Backup Network - Disabled for cluster communications.

DR - Network in use for cluster communications.

10.200.1.7/26

Failure Events in order of time from cluster event logs.

1135 - Cluster node 'SQL-CL03' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

***  (No network connections identified, we have a 3rd party monitoring tool that showed active pings thorough out this event.)

1135 - Cluster node 'SQL-CL01' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1564 - File share witness resource 'File Share Witness' failed to arbitrate for the file share '\\WIN2012-FS01\Witness'. Please ensure that file share '\\WIN2012-FS01\Witness' exists and is accessible by the cluster.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

1069 - Cluster resource 'File Share Witness' of type 'File Share Witness' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

1177 - The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.

Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1561 - The cluster service has determined that this node does not have the latest copy of cluster configuration data. Therefore, the cluster service has prevented itself from starting on this node.

Try starting the cluster service on all nodes in the cluster. If the cluster service can be started on other nodes with the latest copy of the cluster configuration data, this node will be able to subsequently join the started cluster successfully.

If there are no nodes available with the latest copy of the cluster configuration data, please consult the documentation for 'Force Cluster Start' in the failover cluster manager snapin, or the 'forcequorum' startup option. Note that this action of forcing quorum should be considered a last resort, since some cluster configuration changes may well be lost.

1069 - Cluster resource 'WIN2012-SQLAG-01_10.100.1.7' of type 'IP Address' in clustered role 'WIN2012-SQLAG-01' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Thanks for your consideration on this issue.Where else might we search for more information on this issue.

-D


Viewing all articles
Browse latest Browse all 5654

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>