Hi Team,
We have 3 node windows failover cluster running SQL Server. we have cluster disk as quorum. Since it is going to be a AG , 3rd node in DR is part of WSFC but doesn't have vote. Node 1 and node 2 in primary has shared storage between them. node 3 in DR has independent storage.
We are seeing an issues where node 3 in DR site goes down and even after trying multiple things it never comes online again (reboot, network enable disable, ipv6 disable etc. etc.). When we do cluster validation, validation gets failed every time with network communication between node 3 to node 1 and node 2 in primary.
Very odd behavior is that that if we evict and rejoin node 3 again to the cluster and do cluster validation while joining the node, communication established successfully based on the validation report. We have tried it couple of time evict and rejoin and it successful everytime. We are getting a hard time understanding this behavior that why node 3 start communication when we evict it from the cluster and then rejoins also otherwise as a down node it was showing communication failure on port udp 3343 everytime.
Just want to check if anyone has seen this and have some pointers to resolve it or where should we look ? Now we have everything enabled between nodes i.e. firewall any to any, AV is removed from all nodes...what else we need to look?
Any pointers will be appreciated . Thanks
Regards,