Hi All,
I've recently built an 2016 S2D 4 node cluster and have run into major issues with disk performance:
barely getting kb/s throughput (yep kilo and a small b - dial up modem speeds for disk access)
vm's are unresponsive
multiple other issues associated with disk access to the csv's
The hardware is all certified and as per Lenovo's most recent guidelines. Servers are ThinkSystem SR650, the networking is 100Gb/s with 2x Mellanox Connect-X4 adapters per node and 2x Lenovo NE10032 switches, 12x Intel SSD's and 2x Intel NVMe per node for the storage pool. RoCE/RDMA, DCB etc all configured as per the guidelines and verified (as far as I can diagnose). It should be absolutely flying along.
I should point out that it was working OK (though with no thorough testing done) for approx. 1 week. The vm's (about 10 or so) were running fine and any file transfers that were performed were limited by the Gb/s connectivity to the file share source (on older equipment serviced by a 10Gb/s switch uplink and 1Gb/s NIC connections at the source).
About 3pm yesterday I decided to configure the Cluster Aware Updating and this may or may not have been a factor. The servers were already fully patched with the exception of 2 updates: KB4284833 and a definition update for defender. These were installed
and one at a time a manual reboot performed. Ever since, I've had blue screens, nodes/pools/csv's failing over and almost non-existent disk throughput. There is no other significant errors in the event logs, there have been cluster alerts as things go down
- but nothing that has led to a google/bing search for a solution. The immediate thought is going to be "it was KB4284833 what done it" but I'm not certain that is the cause.
Interestingly - when doing a file copy to/from the CSV volumes there is an initial spurt of disk throughput (but no where near as fast as it should be - say up to 100MB/s but could equally be as low as 7MB/s) and then it dies off to kB/s and effectively 0. So it look like there is some sort of cache that is working to some extent and then nothing.
I've been doing a lot of research for the past 24 hours or so - no smoking guns. I did find someone with similar issues that were traced back to the power mode settings - I've since set these to High Performance (rather than the default balanced) but have seen no change (might be worth another reboot to double check this though - will do that shortly)
Any suggestions or similar experience?
Thanks for any help.