Hello all,
Yesterday, we faced an issue early in the day that our cluster removed all active nodes (exchange 2010 DAG). and the root cause was because we lost contact with the alternate cluster node (over a WAN link), and concurrently lost connectivity to the file witness share (event id 1564 was recorded).
The odd thing is that the cluster should be tolerant to having the second node be unreachable, as long as the file witness share is reachable by the node. Being that the file witness share is located at the same physical site (possibly inside the same
hypervisor host), the idea that the share is ever unreachable is very interesting!
The system engineering team has adjusted the CrossSubnetDelay and CrossSubnetThreshold values, but I do not think this will solve the problem of the file witness server being inaccessible by the node.
[PS] C:\>cluster /prop | Select-String "Subnet" D CONTOSO-DAG CrossSubnetDelay 4000 (0xfa0) # not default D CONTOSO-DAG CrossSubnetThreshold 10 (0xa) # not default D CONTOSO-DAG PlumbAllCrossSubnetRoutes 0 (0x0) D CONTOSO-DAG SameSubnetDelay 1000 (0x3e8) # default D CONTOSO-DAG SameSubnetThreshold 5 (0x5) # default
What can I further investigate in relation to the file witness share being inaccessible? Are there any settings that we can adjust to make sure the node is more tolerant to file witness share availability? There may be a variety of things occurring on the file witness share server, like VSS snapshots, etc. But none of these things are expected to be interferring with the operations of (1) the OS, or (2) the availability and accessibility of any share.
Thanks,
Matt