Windows Server 2012 R2 multi-site cluster with 5 nodes. Node 1 (at the main site) fails multiple times a day with the same issue in the cluster log as shown below. All the other nodes log missed heartbeats. If it were latency I would expect all nodes to fail at one point or another but node1 is the only one that drops out. The servers are all the same model with the same drivers. The switch ports show no errors. I see no UDP dropped packates in perfmon. I have checked everything in the following blog: http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx and this one: http://blogs.technet.com/b/askcore/archive/2012/07/09/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx.
I do not want to change the cluster delay and threshold values except as a last resort. The ping times between sites is down in the 3-5ms range even during the issue which is well below the recommended range.
Does anyone know what causes the failure that seems to kick off the issue?
"Failed to retrieve the results of overlapped I/O: (10054)"
Node1:
00000284.00002498::2015/06/15-05:58:35.919 DBG [CHANNEL 169.254.x.x:~3343~]/recv: Failed to retrieve the results of overlapped I/O: (10054)
00000284.00002498::2015/06/15-05:58:35.919 DBG [CHANNEL 169.254.x.x:~3343~] Closing due to error: (0).
00000284.00002498::2015/06/15-05:58:35.919 DBG [CHANNEL 169.254.x.x:~3343~] Close().
00000284.00002498::2015/06/15-05:58:35.919 WARN [CHANNEL 169.254.x.x:~3343~] failure, status (0)
All the other nodes:
00003700.000025a4::2015/06/15-05:58:33.825 DBG [NETFTEVM] FTI NetFT event handler got event: LocalEndpoint 10.x.x.x:~3343~ has missed two consecutive heartbeats from 10.x.x.x:~3343~
00003700.000025a4::2015/06/15-05:58:33.825 DBG [NETFTEVM] TM NetFT event handler got event: LocalEndpoint 10.x.x.x:~3343~ has missed two consecutive heartbeats from 10.x.x.x:~3343~