Hello there,
I'm having a strange situation with our failover cluster after a recent planned outage for windows updates.
We have a 3 node cluster with one CSV, it has been set up for awhile with no issues until the recent planned outage where pausing Node2 to drain its roles caused the issue to be noticed.
Currently Node2 is the CSV owner but if I pause this node or attempt to manually move the CSV to Node1 or Node3 the CSV fails with an error of "0x1 Incorrect Function" and fails back to Node2 again (or just goes offline of Node2 is paused)
I created some logs of the cluster activity when attempting to move and noticed it fails to read the reservation on the disk, below is a snipped of the area where I believe the problem is occuring:
00000c84.00001ca8::2016/08/21-10:04:38.026 INFO [RCM] TransitionToState(SAN Data) OnlineCallIssued-->OnlinePending.00001734.00001554::2016/08/21-10:04:38.026 INFO [RES] Physical Disk <SAN Data>: ResHardDiskArbitrateInternal request Not a Space: Uses FastPath
00001734.00001554::2016/08/21-10:04:38.026 INFO [RES] Physical Disk <SAN Data>: ResHardDiskArbitrateInternal: Clusdisk driver handle or event handle is NULL.
00001734.00001554::2016/08/21-10:04:38.027 INFO [RES] Physical Disk <SAN Data>: HardDiskpQueryDiskFromStm: ClusterStmFindDisk returned device='\\?\mpio#disk&ven_lefthand&prod_iscsidisk&rev_a500#1&7f6ac24&0&363030454233373344433534343531303030303030303233#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}'
00001734.00001554::2016/08/21-10:04:38.047 INFO [RES] Physical Disk <SAN Data>: SetDiskInfo(1)
00001734.00001554::2016/08/21-10:04:38.047 INFO [RES] Physical Disk <SAN Data>: Arbitrate - Node using PR key 25f8bf3b0003734d
00001734.00001554::2016/08/21-10:04:38.047 ERR [RES] Physical Disk <SAN Data>: Failed to read reservation on the disk, status 1
00001734.00001554::2016/08/21-10:04:38.047 ERR [RES] Physical Disk <SAN Data>: ResHardDiskArbitrateInternal: PR Arbitration for disk Error: 1.
00001734.00001554::2016/08/21-10:04:38.048 ERR [RES] Physical Disk <SAN Data>: OnlineThread: Unable to arbitrate for the disk. Error: 1.
00001734.00001554::2016/08/21-10:04:38.048 ERR [RES] Physical Disk <SAN Data>: OnlineThread: Error 1 bringing resource online.
00001734.00001554::2016/08/21-10:04:38.048 INFO [RES] Physical Disk <SAN Data>: HardDiskpSetUnsetDiskFlags(mask=0x00000002, SetCluster=0, SetCsv=0,
The only changes made recently were the CSV being increased in capacity, the space was assigned on the SAN and then extended using DISKPART on the CSV owner however I did notice that DISKPART reports the drive being 8TB when it is actually 9TB. Windows Disk
Management reports it as 9TB as does Failover Cluster Manager.
The iSCSI connectors on all of the Nodes are connected to the SAN and can see the volume in Disk Management (once the CSV is up on Node2, they will run the VMs from it without issue)
I have tried taking the cluster offline and rebooting all of the servers which had no impact.
To retrace recent steps and test functionality I created another small volume on the SAN, added it as an additional CSV and it will happily move between all Nodes.
I also then increased the size of this volume on the SAN and extended it using the same method with DISKPART again with no issues.
I saw another thread on technet where someone had a similar issue after a power failure (though none of their Nodes would connect to the CSV) and running the PowerShell command Clear-ClusterDiskReservation helped in their case but sadly this did nothing to
help in this scenario.
Running this command from Node2 (current CSV owner) ran without any error, running the command from Node1 or Node3 returned a similar error to moving the disk stating an Incorrect function.
I have also tried validating the failover cluster, when validating it with the CSV it fails on Node 1 and cancels the rest of the tests to do with storage however if I just test the small test CSV I created all the tests pass without issue.
I haven't tried removing the CSV and re-adding it yet as we're currently in a crucial time of year so I can't risk it not coming back online again if this fails.
If anyone has any suggestions of what to try or where to look for answers it would be appreciated.