So, we've been having issues with one of our clusters. Yesterday in the evening when no one was working it seems like a bunch of VMs went down. I found some errors in a couple event logs that show it seems the CSV failed but I can't find any indication as to why. My storage appliance has no record of any problems at that time, and I can't find any other possible reasons apart from a problem within the cluster.
All six nodes are running up to date Server 2012 R2, and are Managed by SCVMM 2012 R2 running off a virtual machine hosted by another cluster. My storage is a Tegile ZEBI unit, and I've thin provisioned 20TB of disk space. Disk is accessed by iSCSI on separate
NICs and separate switches from other normal cluster/VM traffic.
Below are the errors, and a screenshot of an "unknown" volume listed under my CSV, seems odd? In cluster Failover manager, under storage\Disks after selecting my CSV, in the bottom pane I see two volumes listed:
In cluster manager, I found this error:
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 2014-12-16 6:41:33 PM
Event ID: 5120
Task Category: Cluster Shared Volume
Level: Error
Keywords:
User: SYSTEM
Computer: CLUSTERHOST4.DOMAIN.INTERNAL
Description:
Cluster Shared Volume 'Volume 1' ('CSV') has entered a paused state because of '(c000000b5)'. All I/O will temporarily be queued until a path to the volume is reestablished.
I went to the node who owned the CSV, and in the event log I found this error:
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 2014-12-16 6:48:22 PM
Event ID: 1230
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: CLUSTERHOST1.DOMAIN.INTERNAL
Description:
A component on the server did not respond in a timely fashion. This caused the cluster resource 'CSV' (resource type 'Physical Disk', DLL 'clusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The
cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with
the resource are functioning correctly.
Then this error:
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 2014-12-16 6:59:39 PM
Event ID: 1146
Task Category: Resource Control Manager
Level: Critical
Keywords:
User: SYSTEM
Computer: CLUSTERHOST1.DOMAIN.INTERNAL
Description:
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing
the issue.
Then this error:
Log Name: System
Source: Microsoft-Windows-Ntfs
Date: 2014-12-16 7:01:03 PM
Event ID: 140
Task Category: None
Level: Warning
Keywords: (8)
User: SYSTEM
Computer: CLUSTERHOST1.DOMAIN.INTERNAL
Description:
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: VirtualMachines, DeviceName: \Device\HarddiskVolume7.
(A device which does not exist was specified.)