Quantcast
Channel: High Availability (Clustering) forum
Viewing all articles
Browse latest Browse all 5654

Server 2012 R2 Cluster Failover Timing

$
0
0

I am running an experiment with clustered servers to understand failover timing and failover recovery.  I have a couple of questions I was hoping this forum could help me with.  My setup is:

  • Two clustered 2012 R2 servers (Node 1 and Node 2) - all updates installed - same subnet - shared iSCSI storage
  • One file server role running
  • Node 1 is current host server
  • Node 1 is owner node for file server role
  • Node 1 is owner node for quorum disk and for file server disk
  • Single file share \\MIKEFS

On the file share, I placed a 64 KB file.  On a client PC (2012 R2) with access to the share, I ran a simple executable I wrote that does the following:

  1.       Get a handle to the file (CreateFile -> FILE_FLAG_NO_BUFFERING)
  2.       Set file position to 0 and synchronous ReadFile of 64 KB / Track execution time
  3.       Loop forever on #2 until any failure occurs
  4.       Close the file handle (CloseHandle)

During the middle of the test, I removed power from Node 1.  The following sequence of events occurred:

  • 8:20:01 - Node 1 Power Loss
  • 8:20:07 - Node 2 records Event ID 1135 (i.e. reports node 1 was removed from the active failover cluster membership)
  • 8:20:39 - 64 KB test application, performing the read test, takes 37.6 seconds to return from ReadFile API
  • Client test application continued to run without any failure (just that one long delay noted above)

Node 2 determined relatively quickly that Node 1 was lost.  The default heartbeat settings help explain the time from the power loss to the time Node 2 reports the loss (SameSubnetDelay=1000 and SameSubnetThreshold=5).

So ~5 seconds of the 37.6 seconds was due to the heartbeat settings.  The remaining ~32.6 seconds was the time to failover the node, mount the volumes, etc.  How can I determine what was going on during this time?  What is taking the longest amount of time?  Are there any settings that could adjust these timings?  I looked at cluster.log but there is a lot of detail in there and it's difficult to determine why it takes a considerable amount of time from the point the node detects the failover.


Viewing all articles
Browse latest Browse all 5654

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>