Where I work we are setting up a new system and I am experiencing a problem I am hoping others have experienced and can offer some recommendations on.
I have setup a new Windows Server 2012 R2 2-node WSFC that is hosting a SQL Server AlwaysOn Availability Group. I have setup AGs before and have never really had problems with them, however now I am starting to see an issue. Our WSFC nodes are on one VLAN and subnet and the clients that connect to SQL (SharePoint servers in this case) are in another VLAN and subnet. Whenever a failover of the AG occurs the SharePoint servers lose connectivity to the AG for approximately 13-15 minutes before connectivity is restored. It is not a routing issue as continuous pings to the WSFC nodes and the WSFC IP never fail, just pings and other connection attempts to the AG listener. It is also not a problem with SQL as I can verify connectivity to a specific AG node.
After looking at this problem with our networking guys the problem appears to be related to gratuitous ARP (GARP). When failover of the AG occurs a GARP request is supposed to update devices on the local network of updated IP/MAC information so that requests to that IP are sent to the right network interface. However I work in a federal government environment and DISA STIGs mandate that infrastructure routers disable GARP (V-5618 for those that are interested https://www.stigviewer.com/stig/infrastructure_router__cisco/2013-10-08/finding/V-5618). This seems to have the effect of traffic outside of the local network timing out after a failover until the entry in the VLANs ARP table expires and gets updated. Disabling that setting for testing showed AG failovers being near instantaneous.
Has anyone encountered this or have any recommendations? I am trying to make the case for multiple NICs on the SharePoint servers so that they can communicate with the SQL servers on the same VLAN, but I am getting push back and am trying to see if there is an alternate solution that can be researched.
I have setup a new Windows Server 2012 R2 2-node WSFC that is hosting a SQL Server AlwaysOn Availability Group. I have setup AGs before and have never really had problems with them, however now I am starting to see an issue. Our WSFC nodes are on one VLAN and subnet and the clients that connect to SQL (SharePoint servers in this case) are in another VLAN and subnet. Whenever a failover of the AG occurs the SharePoint servers lose connectivity to the AG for approximately 13-15 minutes before connectivity is restored. It is not a routing issue as continuous pings to the WSFC nodes and the WSFC IP never fail, just pings and other connection attempts to the AG listener. It is also not a problem with SQL as I can verify connectivity to a specific AG node.
After looking at this problem with our networking guys the problem appears to be related to gratuitous ARP (GARP). When failover of the AG occurs a GARP request is supposed to update devices on the local network of updated IP/MAC information so that requests to that IP are sent to the right network interface. However I work in a federal government environment and DISA STIGs mandate that infrastructure routers disable GARP (V-5618 for those that are interested https://www.stigviewer.com/stig/infrastructure_router__cisco/2013-10-08/finding/V-5618). This seems to have the effect of traffic outside of the local network timing out after a failover until the entry in the VLANs ARP table expires and gets updated. Disabling that setting for testing showed AG failovers being near instantaneous.
Has anyone encountered this or have any recommendations? I am trying to make the case for multiple NICs on the SharePoint servers so that they can communicate with the SQL servers on the same VLAN, but I am getting push back and am trying to see if there is an alternate solution that can be researched.
Joie Andrew "Since 1982"