Apologies for the long-winded description, but I’m currently looking for help, advice etc. with a storage issue I’m currently investigating on a 4-node Windows Server 2008 R2 SP1 failover cluster (HP servers, Emulex HBAs, EMC PowerPath, EMC Clariion CX4-960 disk array) that's proving both interesting and frustrating in equal measure.
With all four cluster nodes up, and the failover cluster fully operational, I am able to happily fail over services and application between cluster nodes without the cluster or operating system reporting any issues.
When the cluster is in this state:
- If I generate a Failover Cluster Validation Report and run all tests (including the storage tests), the report only contains a small number of non-critical warnings.
- If I examine the details of any of the shared cluster disks on any of the nodes using DISKPART, this is what I see (this may not seem important now, but please bear with me):
PowerDevice by PowerPath
Disk ID : <ID> or {<GUID>}
Type : FIBRE
Status : Reserved
Path : 0
Target : 0
LUN ID : <LUN>
Location Path : UNAVAILABLE
Current Read-only State: Yes
Read-only : Yes
Boot Disk : No
Pagefile Disk : No
Hibernation File Disk : No
Crash Crashdump Disk : No
Clustered Disk : Yes
FYI - The cluster uses a mix of MBR and GPT-based cluster disks presented to all nodes via the shared EMC Clariion CX4-960 disk array, SAN fabric, and Emulex HBAs.
However, if I restart one of the cluster nodes, then the affected node begins endlessly cycling through startup -> BSOD -> restart.
The time between startup and BSOD is approximately 25 minutes.
If I generate and check the cluster log (Cluster.log) for the period 25 minutes or so prior to a restart, I see the same entries i.e.
2015/08/10-11:47:59.000 ERR [RHS] RhsCall::DeadlockMonitor: Call OPENRESOURCE timed out for resource 'NODE_DATA'.
2015/08/10-11:47:59.000 ERR [RHS] RhsCall::DeadlockMonitor: Call OPENRESOURCE timed out for resource 'NODE_FLASH'.
2015/08/10-11:47:59.000 INFO [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.
2015/08/10-11:47:59.000 ERR [RHS] Resource NODE_FLASH handling deadlock. Cleaning current operation and terminating RHS process.
2015/08/10-11:47:59.000 INFO [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.
2015/08/10-11:47:59.000 ERR [RHS] Resource NODE_DATA handling deadlock. Cleaning current operation and terminating RHS process.
2015/08/10-11:47:59.000 WARN [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'NODE_FLASH', gen(0) result 4.
2015/08/10-11:47:59.000 ERR [RHS] About to send WER report.
2015/08/10-11:47:59.000 INFO [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'NODE_FLASH' consecutive failure count 1.
2015/08/10-11:47:59.000 WARN [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'NODE_DATA', gen(0) result 4.
2015/08/10-11:47:59.000 INFO [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'NODE_DATA' consecutive failure count 1.
2015/08/10-11:47:59.000 ERR [RHS] About to send WER report.
2015/08/10-11:47:59.075 ERR [RHS] WER report is submitted. Result : WerReportQueued.
2015/08/10-11:47:59.078 ERR [RHS] WER report is submitted. Result : WerReportQueued.
And if I check the System Event Log for the same period, I see:
- Several Event 118, elxstor, “The driver for the device \Device\RaidPort1 performed a bus reset upon request.” Warning messages
- Immediately followed by an Event 1230, FailoverClustering, “Cluster resource ‘NODE_DATA’ (resource type “, DLL ‘clusres.dll’) either crashed or deadlocked. The Resource Handling Subsystem (RHS) process will now attempt to terminate, and the resource will be marked to run in a separate monitor.” Error message
- Immediately followed by an Event 1230, FailoverClustering, “Cluster resource ‘NODE_FLASH’ (resource type “, DLL ‘clusres.dll’) either crashed or deadlocked. The Resource Handling Subsystem (RHS) process will now attempt to terminate, and the resource will be marked to run in a separate monitor.” Error message.
NOTE - NODE_DATA and NODE_FLASH are GPT-based cluster disks.
What is interesting is not what is happening, but why!
In terms of the “what” I believe that, at the failover clustering level, the startup -> BSOD -> restart behaviour is a result of the following:
- RHS calls an entry point to resources NODE_DATA and NODE_FLASH;
- RHS waits DeadlockTimeout (5 minutes) for the resources to respond;
- The resources do not respond, and so the Cluster Service (ClusSvc) terminates the RHS process to recover from unresponsive resource;
- The Cluster Service (ClusSvc) waits DeadlockTimeout x 4 (20 minutes) for the RHS process to terminate;
- Since the RHS process does not terminate, the Cluster Service (ClusSvc) calls NetFT to bugcheck the node to recover from the RHS termination failure;
- NetFT bugchecks the node with a STOP.
But why aren’t the NODE_DATA and NODE_FLASH cluster resources responding to the OPENRESOURCE calls?
After a lot of digging around Windows Event Logs and cluster logs, I decided to check the status of the various cluster disks from the perspective of both the working cluster nodes and the failing cluster node.
What I saw when when I examined the details of the shared cluster disks on each node using the DISKPART utility (also backed-up by what I was seeing in the Disk Management MMC) was as follows:
Working Cluster Node (MBR-based disks):
PowerDevice by PowerPath
Disk ID : <ID>
Type : FIBRE
Status : Reserved
Path : 0
Target : 0
LUN ID : <LUN>
Location Path : UNAVAILABLE
Current Read-only State: Yes
Read-only : Yes
Boot Disk : No
Pagefile Disk : No
Hibernation File Disk : No
Crash Crashdump Disk : No
Clustered Disk : Yes
Working Cluster Node (GPT-based disks):
PowerDevice by PowerPath
Disk ID : {<GUID>}
Type : FIBRE
Status : Reserved
Path : 0
Target : 0
LUN ID : <LUN>
Location Path : UNAVAILABLE
Current Read-only State: Yes
Read-only : Yes
Boot Disk : No
Pagefile Disk : No
Hibernation File Disk : No
Crash Crashdump Disk : No
Clustered Disk : Yes
Failing Cluster Node (MBR-based disks):
PowerDevice by PowerPath
Disk ID : <ID>
Type : FIBRE
Status : Reserved
Path : 0
Target : 0
LUN ID : <LUN>
Location Path : UNAVAILABLE
Current Read-only State: Yes
Read-only : Yes
Boot Disk : No
Pagefile Disk : No
Hibernation File Disk : No
Crash Crashdump Disk : No
Clustered Disk : Yes
Failing Cluster Node (GPT-based disks):
PowerDevice by PowerPath
Disk ID : 00000000
Type : FIBRE
Status : Offline
Path : 0
Target : 0
LUN ID : <LUN>
Location Path : UNAVAILABLE
Current Read-only State: Yes
Read-only : Yes
Boot Disk : No
Pagefile Disk : No
Hibernation File Disk : No
Crash Crashdump Disk : No
Clustered Disk : No
i.e. The failing node appears to not be recognising the GPT-based cluster disks as clustered disks (or even configured disks).
Once in this failed state, the following process seems to allow the failing cluster node to join the cluster:
- Using cluster.exe or Failover Cluster Manager, take ALL GPT-based cluster disks offline (if ANY of the GPT-based cluster disks are online when the Cluster Service on the failing node is re-started in Step 4 below, the failing node returns to it’s cycle of start -> BSOD -> start);
- Set the startup type of the Cluster Service (ClusSvc) on the failing node to Manual, then wait for the failing node to restart;
- Rescan storage on the failing cluster node;
- Restart the Cluster Service (ClusSvc) on the failing node.
I’ve tried to reproduce this issue on another Windows Server 2008 R2 SP1 failover cluster (same patch level as the issue cluster, with the same HBAs, MPIO software etc., and using a similar mix of MBR and GPT-based cluster disks presented to all nodes via the same shared EMC Clariion CX4-960 disk array and SAN fabric), but I just can’t get the new cluster to exhibit the same behaviour.
NOTES -
- This issue only seems to occur after a node restarts (i.e. if I stop then re-start the Cluster Service on any particular node, then the issue does not appear).
- This issue doesn’t appear to be node-specific (i.e. the same issue occurs irrespective of which node is restarted).
But what I’d like to know, is:
What is / could be preventing the failing node from recognising the GPT-based cluster disks as cluster disks?
Currently, my working assumption is that if I can answer this question and solve this issue, the blocker to the OPENRESOURCE call succeeding will be removed, and the Cluster Service on the failing node will be able to restart following a server crash / restart.
Any help, advice etc. appreciated.