For those of you who’ve managed and implemented Hyper-V infrastructures for your clients, have you experienced cases where clusters become unstable under 70-85% RAM utilization per node?
We had two unplanned outages on our Hyper-V cluster over the last seven days, and in both cases a blue screen on one node resulted in a cascade of blue screens on the other nodes. We’re running a Windows Server 2012 Hyper-V cluster that is currently up to date with Windows Updates, 22 hotfixes, drivers, and firmware. I rebuilt one of the nodes over the weekend, and it even blue screened when reaching 95-98% RAM utilization.
We have two other clusters with similar hardware in our environment that are up to date and running fine, but the load on these clusters is anywhere between 20% and 15% per node.
More info:
Bugchecks 0x00000027 and 0x0000001E from one node lead to bugchecks 0x0000009E on all other nodes.
We've applied as many publicly available hotfixes, windows updates, firmware, and drivers as we've been able to find.