I am building a 6 Node cluster, 12 6TB drives, 2 4TB Intel p4600 PCIe NVME drives - Xeon Plat 8168/768GB Ram, LSI9008 HBA.
The cluster passes all tests, switches are properly configured and the cluster works well, exceeding 1.1 million IOPS with VMFleet. However, at current patch as of now (April 18 2018) I am experiencing the following scenario:
When no storage job is running, all vdisks listed as healthy and I pause a node and drain it, all is well, until the server actually is rebooted or taken offline. At that point a repair job is initiated, and IO suffers badly, and can even stop all together, causing vdisks to go in to paused state due to IO timeout. (listed as the reason in cluster events) Exacerbating this issue, when the paused node reboots and joins, it will cause the repair job to suspend, stop, then restart (it seems.. tracking this is hard was all storage commands become unresponsive while the node is joining) At this point io is guaranteed to stop on all vdisks at some point for long enough to cause problems, including causing VM reboots. The cluster was initially formed using VMM 2016. I have tried manually creating the vdisks, using single resiliency (3 way mirror), multi tier resiliency, same effect. This behavior was not observed when I did my POC testing last year. Its frankly a deal breaker and unusable, as if I cannot reboot a single node without stopping entirely my workload, I cannot deploy. I'm hoping someone has some info. I'm going to re-install with Server 2016 RTM media and keep it unpatched, and see if the problem remains. However it would be desirable to at least start the cluster at full patch. Any help appreciated. Thanks