Quantcast
Channel: High Availability (Clustering) forum
Viewing all articles
Browse latest Browse all 5654

Server 2012 R2 Cluster Failover issues during catastrophic failure of iSCSI shared storage on Node 1 of 2

$
0
0

On this posting, I detailed a simple two node Server 2012 R2 cluster configured with a single file server role.  Both nodes accessing shared iSCSI storage.  Client system is performing a simple looped 64 KB read on a file share from the clustered server.  This test applet will fail any time the ReadFile() API fails.

When one node experiences a complete power loss, the other node will take over and no failure will occur on the client system (other than an ~30 second delay where the synchronous ReadFile() API does not return).  This was detailed in the other post. The key point is that failover completed successfully and NO failure was seen on the client PC.

I want to test out a different failover scenario so with Node 1 being the current host, and the owner node for the file server role, quorum disk, and shared storage, I simulate a catastrophic failure of Node 1’s iSCSI storage.  The OS itself boots from a virtual ATA drive so the OS drive will remain alive.  What I do is go into device manager, find the NIC(s) that are configured for the iSCSI network, and then I disable them.  This causes all iSCSI I/O activity to fail.  When I perform this test, the following is seen from the client PC:

  • 11:20:01 - Node 1 loss of iSCSI shared storage
  • 11:22:01 - Synchronous ReadFile() returns after 120 seconds with ERROR_INVALID_HANDLE
  • 11:22:20 - CloseHandle() completes successfully after 19 seconds

I understand the 120 second delay.  It’s a combination of the iSCSI link down timer as well as the MPIO PDO remove period timer.  What I don’t understand is why the request failed.  I expected a failover and successful completion of the ReadFile like I saw when I did a power loss of Node 1.

What’s interesting is that Windows did move the host server, file server, and disk storage, to Node 2.  So a failover did occur.  At least this time it did. Other times, I have seen the file server role stopped.  It appears that the failover occurred immediately after the client PC received the read failure.  Why not complete the failoverbefore failing the client ReadFile request?  Are there timeout adjustments I could experiment with?

Here is Node 1's event log which shows the failover sequence from Node 1's perspective starting just after the catastrophic iSCSI loss on Node 1:

   ProviderName: Microsoft-Windows-FailoverClustering

TimeCreated                     Id LevelDisplayName Message
-----------                     -- ---------------- -------
7/29/2015 11:20:06 AM         1132 Information      Cluster network interface 'TESTCLUSTER1 - Ethernet 2' for node 'TESTCLUSTER1' on network 'Cluster Network 1' was removed.
7/29/2015 11:21:38 AM         1649 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has taken more than one minute to respond to a control code. The control code was 'STORAGE_GET_DISK_INFO_EX'.
7/29/2015 11:22:01 AM         5264 Information      Physical Disk resource 'd47df305-c3a6-4bbe-8475-48e1398bbee6' has been disconnected from this node.
7/29/2015 11:22:01 AM         5264 Information      Physical Disk resource 'b543b5ab-67c9-4836-89a1-ae5636916de5' has been disconnected from this node.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Online to state ProcessingFailure.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 1' is waiting on the following resources: .
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Terminating to state DelayRestartingResource.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Online to state ProcessingFailure.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 2' is waiting on the following resources: File Server (\\DATAFS).
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToTerminate. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: .
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state Terminating to state WaitingToComeOnline. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: Cluster Disk 2.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Terminating to state DelayRestartingResource.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state DelayRestartingResource to state OnlineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state OnlineCallIssued to state ProcessingFailure.
7/29/2015 11:22:02 AM         1633 Information      The Cluster service failed to bring clustered role 'Cluster Group' completely online or offline. One or more resources may be in a failed or an offline state. This may impact the availability of the clustered role.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 1' is waiting on the following resources: .
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Terminating to state CannotComeOnlineOnThisNode.
7/29/2015 11:22:02 AM         1153 Information      The Cluster service is attempting to fail over the clustered role 'Cluster Group' from node 'TESTCLUSTER1' to node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state DelayRestartingResource to state OnlineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state OnlineCallIssued to state ProcessingFailure.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 2' is waiting on the following resources: File Server (\\DATAFS).
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state WaitingToComeOnline to state OfflineDueToProvider. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: Cluster Disk 2.
7/29/2015 11:22:02 AM         1203 Information      The Cluster service is attempting to bring the clustered role 'Cluster Group' offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'Cluster IP Address' is waiting on the following resources: Cluster Name.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'Cluster Name' is waiting on the following resources: .
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Terminating to state CannotComeOnlineOnThisNode.
7/29/2015 11:22:02 AM         1204 Information      The Cluster service successfully brought the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflineCallIssued to state OfflinePending.
7/29/2015 11:22:02 AM         1153 Information      The Cluster service is attempting to fail over the clustered role 'DATAFS' from node 'TESTCLUSTER1' to node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM         1203 Information      The Cluster service is attempting to bring the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'DATAFS' is waiting on the following resources: .
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'IP Address 10.18.236.0' is waiting on the following resources: DATAFS.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflineCallIssued to state OfflinePending.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflinePending to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state OfflineCallIssued to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1204 Information      The Cluster service successfully brought the clustered role 'Cluster Group' offline.
7/29/2015 11:22:02 AM         1641 Information      Clustered role 'Cluster Group' is moving to cluster node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflinePending to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state OfflineCallIssued to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1204 Information      The Cluster service successfully brought the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM         1641 Information      Clustered role 'DATAFS' is moving to cluster node 'TESTCLUSTER2'.


Viewing all articles
Browse latest Browse all 5654

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>