Hi all, I am having an issue hopefully someone can help me with. I have recently inherited a 2 node cluster, both nodes are one half of an ASUS RS702D-E6/PS8 so both nodes should be near identical. They are both running Hyper-V Server 2008 R2 hosting some 14 VM's.
Each node is hooked up via cat5e to a PromiseVessRAID 1830i via iSCSI using one of the servers onboard NICs each, whose cluster network is setup as Disabled for cluster use (the way I think it is supposed to be not the way I had originally inherited it) on it's own Class A Subnet and on it's own private physical switch...
The SAN hosts a 30GB CSV Witness Disk and 2 2TB CSV Volumes, one for each node labeled Volume1 and Volume2. Some VHD's on each.
The Cluster Clients connect to the rest of the company via the Virtual ExternalNIC adapters created in Hyper-V manager but physically are off of Intel ET Dual Gigabit adapters wired into our main core switch which is set up with class c subnets.
I also have a crossover cable wired up running to the other ports on the Intel ET Dual Port NICs using yet a third Class B Subnet and is configured in the Failover Cluster Manger as internal so there are 3 ipv4 Cluster networks total.
Even though the cluster passes the validation tests with flying colors I am not convinced all is well. With Hyperv1 or node 1, I can move the CSV's and machines over to hyperv2 or node 2, stop the cluster service on 1 and perform maintenance such as a reboot or install patches if needed. When it reboots or I restart the cluster service to bring it back online, it is well behaved leaving hyperv2 the owner of all 3 CSV's Witness, Volume 1 and 2. I can then pass them back or split them up any which way and at no point is cluster service interrupted or noticed by users, duh I know this is how it is SUPPOSED to work but...
if I try the same thing with Node 2, that is move the witness and volumes to node 1 as owner and migrate all VM's over, stop cluster service on node 2, do whatever I have to do and reboot, as soon as node 2 tries to go back online, it tries to snatch volume 2 back, but it never succeeds and then the following error is logged in cluster event log:
Hyperv1
Event ID: 5120
Source: Microsoft-Windows-FailoverClustering
Task Category: Cluster Shared Volume
The listed message is:Cluster Shared Volume 'Volume2' ('HyperV1 Disk') is no longer available on this node because of 'STATUS_MEDIA_WRITE_PROTECTED(c00000a2)'. All I/O will temporarily be queued until a path to the volume is reestablished.
Followed 4 seconds later by:
Hyperv1
event ID: 1069
Source: Microsoft-Windows-FailoverClustering
Task Catagory: Resource Control Manager
Message: Cluster Resource 'Hyperv1 Disk in clustered service or application '75d88aa3-8ecf-47c7-98e7-6099e56a097d' failed.
- AND -
2 of the following:
Hyperv1
event ID: 1038
Source: Microsoft-Windows-FailoverClustering
Task Catagory: Physical Disk Resource
Message: Ownership of cluster disk 'HyperV1 Disk' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.
Followed 1 second later by another 1069 and then various machines are failing messages.
If you browse to\\hyperv-1\c$\clusterstorage\ or\\hyperv-2\c$\Clusterstorage\, Volume 2 is indeed missing!!
This has caused me to panic a few times as the first time I saw this I thought everything was lost but I can get it back by stopping the service on node 1 or shutting it down, restarting node 2 or the service on node 2 and waiting forever for the disk to list as failed and then shortly thereafter it comes back online. I can then boot node 1 back up and let it start servicing the cluster again. It doesn’t pull the same craziness node 2 does when it comes online; it leaves all ownership with 2 unless I tell I to move.
I am very new to clusters and all I know at this point is this is pretty cool stuff but basically if it is running don’t mess with it is the attitude I have taken with it but there is a significant amount of money tied up in this hardware and we should be able to leverage this as needed, not wonder if it is going to act up again.
To me it seems for a ‘failover’ cluster it should be way more robust than this...
I can go into way more detail if needed but I didn’t see any other posts on this specific issue no matter what forum I scoured. I’m obviously looking for advice on how to get this resolved as well as advice on whether or not I wired the cluster networks correctly. I am also not sure about what protocols are bound to what nics anymore and what the binding order should be, could this be what is causing my issue?
I have NVSPBIND and NVSPSCRUB on both boxes if needed.
Thanks!
-LW