I had a lab environment setup that works great, passes validation, can do live migrations without issue but as soon as I restarted one of the nodes, the then still live node became the only node able to access the storage backend. What's weird is that the restarted node can still access the CSV storage and run VMs off of it, but the validation report is unable to list the actual disks.
My Cluster consists of 2 nodes. I have an iSCSI backed shared storage server and I can see that both of my nodes are connected to the iSCSI targets successfully, but the node I first restarted no longer lists any disks/volumes in disk management and the once available MPIO menus are disabled in the iSCSI control panel. I also tried to restart the second node after the first node came back but although the first node was up and running and had VMs on it, restarting the second node brought the entire cluster down. I see event IDs 1177, 1573, and 1069 appear in the Cluster Events log. When the second node came back up, the cluster came back with it, but not the storage. Both nodes seem to display similar behavior in that they cannot access the storage backend. Now the storage is inaccessible by both nodes. I was able to get both nodes connected to the storage backend by going to the iscsicpl and disconnecting all current connections to the iSCSI backend and adding them back. Doing the test again after bringing the storage back up resulted in the same behavior and this time redoing the iSCSI connections is not helping.
I think the issue here is that the first node I restarted is unable to see any disks/volumes from the storage backend only after joining the cluster and doing a restart. Before joining the cluster I did reboots on both nodes and both were able to connect to the iSCSI backend without issue. It wasn't until after joining the cluster that node 1 became unable to access the storage backend after reboots. The validation report fails with "No disks were found on which to perform cluster validation tests. To correct this, review the following possible causes: ..." although none of the suggestions seem applicable and the validation report was successful right before the restart of the node.
Does anyone have suggestions on how to further troubleshoot or resolve this issue?
I am using Hyper-V Server 2012 R2 on both nodes and they are joined to the same domain.