Quantcast
Channel: High Availability (Clustering) forum
Viewing all articles
Browse latest Browse all 5654

Random Cluster Failures

$
0
0

Hey guys, 

Really need a hand here, I have a production cluster with 2 R630s 256g RAM, 3 R610s 192g RAM 1 that is a hot spare on 2012R2 Data Center. Recently I updated the NICS with Microsoft drivers (intel ethernet server adapter x520-2 driver 2012r2 data center) and shortly after starting having a lot of VMs randomly failing on random hosts, a few at a time.

160VMs that average 30-40 VMs per host.

After updates, re-installs of actual intel drivers, pushing out VM hardware re-configurations, i'd finally realized a huge issue. The driver update cut the VMQ ports back to the default 32. 

Reconfigured all of them back to 64 and for a few days i had no issues and was sure I had found the issue.

Came in this morning to find out over the weekend there was another 15 reboots.

So far the only commonality I've found is that this has only happened to our Gen 1 systems (we have 90 so far 51 have had reboots)

Here's a snipit from the cluster log around a VM failure:

0000117c.00002cfc::2016/12/19-07:34:57.705 INFO  [RHS] Resource Virtual Machine Configuration <VM NAME> called SetResourceLockedMode. LockedModeEnabled0, LockedModeReason0.
00000d8c.000029d0::2016/12/19-07:34:57.705 INFO  [RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine Configuration <VM NAME>', gen(0) result 0/0.
00000d8c.000029d0::2016/12/19-07:34:57.705 INFO  [RCM] Virtual Machine Configuration epcr-harvardil: Flags 1 removed from StatusInformation. New StatusInformation 0
0000117c.00002cfc::2016/12/19-07:34:57.705 INFO  [RHS] Resource Virtual Machine <VM NAME> called SetResourceLockedMode. LockedModeEnabled0, LockedModeReason0.
00000d8c.000029d0::2016/12/19-07:34:57.705 INFO  [RCM] <VM NAME>: Removed Flags 1 from StatusInformation. New StatusInformation 0
0000117c.00002cfc::2016/12/19-07:34:57.705 INFO  [RES] Virtual Machine <Virtual Machine <VM NAME>>: Current state 'Terminated', event 'VmStopped'
00000d8c.000029d0::2016/12/19-07:34:57.705 INFO  [RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine <VM NAME>', gen(3) result 0/0.
00000d8c.00000944::2016/12/19-07:34:57.705 INFO  [GUM] Node 3: executing request locally, gumId:71035, my action: /dm/update, # of updates: 1
00000d8c.00000f90::2016/12/19-07:34:57.705 INFO  [DM] Starting replica transaction, paxos: 460:460:576650, smartPtr: HDL( 2c83f5f2b0 ), internalPtr: HDL( 2c85294340 )
00000d8c.00000f90::2016/12/19-07:34:57.720 INFO  [DM] Finished replica transaction, paxos: 460:460:576650, smartPtr: HDL( 2c83f5f2b0 ), internalPtr: HDL( 2c85294340 ), status: 0
00000d8c.00000944::2016/12/19-07:34:57.720 INFO  [RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine <VM NAME>', gen(3) result 0/0.

Logs are also littered with these SQL errors which was what eventually led me to updating the hardware configurations of the VMs:

00000af0.0000239c::2016/12/19-03:22:41.113 ERR   [RHS] s_RhsRpcCreateResType: (126)' because of 'Error loading resource DLL fssres.dll.'
00000cec.000006f8::2016/12/19-03:22:41.113 INFO  [RCM] result of first load attempt for type SQL Server FILESTREAM Share: 126
000014e0.000026c0::2016/12/19-03:22:41.129 INFO  [RES] Physical Disk: HarddiskpIsPartitionHidden: device \Device\Harddisk2\ClusterPartition2 0
00000af0.0000239c::2016/12/19-03:22:41.238 ERR   [RHS] s_RhsRpcCreateResType: (126)' because of 'Error loading resource DLL hadrres.dll.'
00000cec.00001e88::2016/12/19-03:22:41.238 INFO  [RCM] result of first load attempt for type SQL Server Availability Group: 126
00000af0.0000239c::2016/12/19-03:22:41.254 ERR   [RHS] s_RhsRpcCreateResType: (126)' because of 'Error loading resource DLL fssres.dll.'
00000cec.00001e88::2016/12/19-03:22:41.254 INFO  [RCM] result of first load attempt for type SQL Server FILESTREAM Share: 126

Any Ideas???


Viewing all articles
Browse latest Browse all 5654

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>