I am running an experiment with clustered servers to understand failover timing and failover recovery. I have a couple of questions I was hoping this forum could help me with. My setup is:
- Two clustered 2012 R2 servers (Node 1 and Node 2) - all updates installed - same subnet - shared iSCSI storage
- One file server role running
- Node 1 is current host server
- Node 1 is owner node for file server role
- Node 1 is owner node for quorum disk and for file server disk
- Single file share \\MIKEFS
On the file share, I placed a 64 KB file. On a client PC (2012 R2) with access to the share, I ran a simple executable I wrote that does the following:
- Get a handle to the file (CreateFile -> FILE_FLAG_NO_BUFFERING)
- Set file position to 0 and synchronous ReadFile of 64 KB / Track execution time
- Loop forever on #2 until any failure occurs
- Close the file handle (CloseHandle)
During the middle of the test, I removed power from Node 1. The following sequence of events occurred:
- 8:20:01 - Node 1 Power Loss
- 8:20:07 - Node 2 records Event ID 1135 (i.e. reports node 1 was removed from the active failover cluster membership)
- 8:20:39 - 64 KB test application, performing the read test, takes 37.6 seconds to return from ReadFile API
- Client test application continued to run without any failure (just that one long delay noted above)
Node 2 determined relatively quickly that Node 1 was lost. The default heartbeat settings help explain the time from the power loss to the time Node 2 reports the loss (SameSubnetDelay=1000 and SameSubnetThreshold=5).
So ~5 seconds of the 37.6 seconds was due to the heartbeat settings. The remaining ~32.6 seconds was the time to failover the node, mount the volumes, etc. How can I determine what was going on during this time? What is taking the longest amount of time? Are there any settings that could adjust these timings? I looked at cluster.log but there is a lot of detail in there and it's difficult to determine why it takes a considerable amount of time from the point the node detects the failover.