I’m in charge of Root Cause for the W2k8 clusters for our company. About a couple of months ago I was made aware of a problem where on occasions these SQL clusters running W2k8 Enterprise stop responding. When I was made aware of it I was able to run cluster commands from the passive nodes. What I saw were the groups in an online pending or offline pending or failed. I could not manage the cluster unless I terminate the clussvc.exe on the active node that owns the resources. Once I do that, the cluster fails over. I’m unable to determine what is causing the resources to fail in the first place because there’s nothing in the event logs to indicate an issue occurred before the resources had issues beside the fact that the resources failed themselves with event ID .
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 7/30/2013 5:16:19 PM
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: Servername. Domain.name.Net
Description:
Cluster resource 'SQL Server (xxxxxx)' in clustered service or application 'SQLGroupxxxxx' failed.
After this happen the cluster becomes a zombie until I can kill clussvc.exe and make it failover.
I’ve opened a case with Microsoft. Complete crash dumps have been enabled with notmyfaul but we have been unsuccessful with any root cause regarding any dump analysis. It is now in the hands of Microsoft, level 3.
-SluggoMagoo-