Quantcast
Channel: High Availability (Clustering) forum
Viewing all 5654 articles
Browse latest View live

Collecting Cluster Performance Data

$
0
0
I’ve been using Windows Admin Center to view performance data as I execute different types of workloads in a Cluster with VMs and record the results. It works well visually, but if I run the workload for a specific amount of time the data can be skewed depending on when the snapshot at the moment was taken when I record those results. I think results that show a high, low, and average number to a counter may be better to compare with other results. Looking to collect the obvious CPU, Memory, IOPS, Latency, Throughput, etc.

I’m looking for an efficient method to collect this from nodes in a cluster running Hyper-V and S2D. Should I run PerfMon on all the nodes, or is there a way to calculate more efficiently using something like Get-ClusterPerformanceHistory, anything else I am missing?


T.J.



Admin Center hyper-converged cluster error ('There are no more endpoints available from the endpoint mapper.').

$
0
0

Hello! I have  hyper-converged s2d cluster on windows server 2016 nodes. I'm trying manage it whith admin center. Everything was done with help of this article https://docs.microsoft.com/en-us/windows-server/manage/windows-admin-center/use/manage-hyper-converged

But when i'm trying to connet whith admin center to s2d i get the error "Unable to create the "SDDC Management" cluster resource (required)" and int cluster events i recieve this error:

Cluster resource 'SDDC Management' of type 'SDDC Management' in clustered role 'Cluster Group' failed. The error code was '0x6d9' ('There are no more endpoints available from the endpoint mapper.').

 

then

The Cluster service failed to bring clustered role 'Cluster Group' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

and then

Clustered role 'Cluster Group' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

What i'm done wrong?

Disk Sharing - Server 2016 Cluster

$
0
0

I'm sure this is probably a newbie question, but it is so hard to find information.

A person would think the ability to do this would be the most fundamental idea in failover clustering.

I have Failover Clustering set up, everything seems to be working fine. Also have SQL server (2017) clustering set up and everything seems to be working fine. I am able to run SQL queries, etc. from both nodes and a different computer.

The problem is support files. We have spreadsheets and document templates that users need to be able to access. I try to put them on the cluster nodes, but I cannot "Share" any of the folders in Active Directory

The File Server role is installed on both nodes (I call them SQL2 and SQL3). The drive I wish to share is listed in the Failover Cluster Manager as "Available Storage".

I put the drive in maintenance mode, and was able to share a folder on SQL3, but it seems to be shared only on that node (the path is '\\SQL3\HDrive')

In Computer Management,  on SQL2 , I cannot even see the drive in "Sharing." or in File Manager. But it does show up in Cluster Manager...

Will SQL2 see it if SQL3 goes offline?

Have I got it right? Or what am I missing?

2019 Hyper-V Cluster - Quorum

$
0
0

Hi All,

I just finished setting up a hyper-V cluster in our environment, a 3 node cluster.

since the cluster node count is "Odd" will it still be recommended rather necessary to put a witness (disk) in this case?

what i normally practice is to only put a witness when the cluster counts is "even" (for tie-breaking)

hopefully you can share me your thoughts

Thanks

S2D on 2019 - Perhaps a bug

$
0
0

Hi,

I have my homelab with two Dell R710 with 6 HDD and 2 NVMe in both servers.

I have configured two virtual disks, created as nested mirror-accelerated parity, but if I suspend a node and reboots it, all vdisks goes offline with the error 

The pack does not have a quorum of healthy disks.

This was working fine on the same servers on 2016.

But if I run this before the reboot, all vdisks stays online as they should.

Get-StorageScaleUnit -FriendlyName $Env:COMPUTERNAME | Enable-StorageMaintenanceMode

Anyone else have the same experience ?

BR

Martin

Cluster Shared Volumes: How to mask/unmask disk???

Upgrading the Network Load Balancing(NLB) cluster from 2008r2 to 2012r2

Server 2008 R2 cluster share volume showing RAW

$
0
0

I have a msa san with 4 hosts running server 2008 r2 core and hyper v in a cluster. 

I recently expanded the cluster shared volume.  Everything seemed ok as all nodes were able to see the now larger CSV.  I then was able to increase the vhd of one of my virtual machines.  After a few days random files on all my VM's started saying they were corrupted and inaccessible.  Some VM's got stuck in a stopping state and some completely stopped, while 2 other VM's stayed running with a couple corrupted folders.  I was also getting numerous errors in scvmm when trying to do anything, such as create a new VM:

Warning (2912)
An internal error has occurred trying to contact an agent on the FHNJHV8E03.pdc1.com server.
 (The file or directory is corrupted and unreadable (0x80070570))

Recommended Action
Ensure the agent is installed and running. Ensure the WS-Management service is installed and running, then restart the agent.

I contacted microsoft for support.  They kept escalating my issue, until i got a cluster expert who decided to remove the cluster during troubleshooting.  The CSV volume immediately went into RAW state.  I was then escalated to a disk specialist who attempted to recover the CSV volume by using diskprobe2.  Microsoft was unable to recover the CSV volume and said it would have to be formatted. 

Any help or ideas before i am stuck formatting the CSV volume and rebuilding each VM would be great.  As i only have backups of the files on the VM's

or even ideas of third party recovery. 


Failover Cluster Manager bug on Server 2019 after .NET 4.8 installed - unable to type more than two characters in to the IP fields

$
0
0

We ran into a nasty bug on Windows Server 2019 and I can't find any KB articles on it. It's really easy to replicate. 

1. Install Windows Server 2019 Standard with Desktop Experience from an ISO. 

2. Install Failover Cluster Services.

3. Create new cluster, on the 4th screen, add the current server name. This is what it shows:

cluster services working correctly before .NET 4.8 is installed

4. Install .NET 4.8 from an offline installer. (KB4486153) and reboot.

5. After the reboot, go back to the same screen of the same Create Cluster Wizard and now it looks different:

cluster services broken afte.NET 4.8 is installed - unable to put in a 3-digit IP

Now we are unable to type in a 3 digit IP in any of the octet fields. It accepts a maximum of two characters. 

Has anyone else encountered this? It should be really easy to reproduce. 

Windows cluster - Hardware migration with Windows 2016 upgrade

$
0
0

Dear Cluster Gurus.

This is my scenario:

Existing setup : HP Gen 7 physical hosts with Windows 2008 R2 --> We have two node windows cluster (node1 & node2) running Oracle database / 3rd part applications.  has 4 virual IP's and 10 shared disks

New Setup : we want to migrate the above clusters to new Hardware HP gen 10 with windows 2016 LTSB. without changing the hostnames and cluster names..

Proposed method 1: Take image of node1 and node2 (wind 2008 R2) --> clone that image to new Hadware node1 & node2 --> shutdown the old node1 & node2 --> startup the new node1 & node2 --> start windows 2016 upgrade on node1 & node2 

Does the above method work?. anyone tried this?.

what are the alternate solutions?

Thanks

Windows Server 2016 Failover Cluster Get-Volume lists all volumes

$
0
0

I created a 2-node failover cluster in my Hyper-V environment. 

My concern here is that when I RAN:

Format-Volume -DriveLetter D

The D drives on both nodes were formatted.

When I ran Get-Volume on one of the nodes, I noticed that my D & E drives on each node was listed twice.

I noticed that 'Storage Replica' was added as a Cluster Resource Type and that the following device is installed:

Microsoft ClusPort HBA

Which some cursory research says:

"The Software Storage Bus (SSB) is a virtual storage bus spanning all the servers that make up the cluster. SSB essentially makes it possible for each server to see all disks across all servers in the cluster providing full mesh connectivity. SSB consists of two components on each server in the cluster; ClusPort and ClusBlft. ClusPort implements a virtual HBA that allows the node to connect to disk devices in all the other servers in the cluster. ClusBlft implements virtualization of the disk devices and enclosures in each server for ClusPort in other servers to connect to."

Is this by design? Is there a way to disable this? How do we fix this?

Windows Server 2016 Standard, running on Hyper-V



Can't move VMs in cluster to a particular host

$
0
0

I have a 3 node, 2016 Datacenter cluster. Multiple VMs. Right now, all VMs are on hosts 1 and host 2. If I try to live migrate to host 3, I get event id 1069 and 21502. I can migrate between hosts 1 and 2 at will with no problem. Even when I try a quick migrate, the VM appears to move to host 3, but when I start it, it fails immediately.

The thing I've noticed is that I can access the Cluster Shared Volume from windows explorer on host 1 and host 2. If I try to access it on host 3 I get:

C:\clusterstorage\volume1 is not accessible. The referenced account is currently locked out and may not be logged on to.

The 1069 error reads:

Cluster resource 'Virtual Machine X' of type 'Virtual Machine' in clustered role 'X' failed. The error code was '0x775' ('The referenced account is currently locked out and may not be logged on to.').


Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

The 21502 error mentions:

'Virtual Machine X' failed to start.

'X' failed to start. (Virtual machine ID blah_blah)

'X' failed to start worker process: The referenced account is currently locked out and may not be logged on to. (0x80070775). (Virtual machine ID blah_blah)

'Virtual Machine Name Unavailable' could not initialize. (Virtual machine ID blah_blah)

'Virtual Machine Name Unavailable' could not read or update virtual machine configuration: The referenced account is currently locked out and may not be logged on to. (0x80070775). (Virtual machine ID blah_blah)

What account is it referring to? Shouldn't this be happening across all hosts instead of host 3?

Any help is much appreciated.

2-node (Hyper-V) Failover Cluster dependency on Domain Controllers, DNS Servers, File Share Witness server

$
0
0

We have recently configured a two-node Failover Cluster for a client (a large multi-campus university). It is a Storage Spaces Direct cluster and it runs HyperV virtual machines. It is Windows Server 2016.

The two servers sit in the same rack and are connected through two switches, which also sit in that rack. We were hoping that with this configuration the virtual machines would have fairly decent availability.

Yesterday the client had a mishap during an attempt to update firmware in a router. This router connects the cluster with the rest of their infrastructure, including:

- DNS Servers

- Domain Controllers

- The file server, which we use as a file share witness in the cluster.

The result of losing that connection for 5-10 minutes was that all the virtual machines in the cluster stopped abruptly (no proper shutdown). They were automatically started again once the network connection was re-established, but obviously it was not a nice experience.

A few questions:

Is this expected behavior?

To what degree are failover clusters dependent on access to domain controllers, DNS Servers and witnesses (in the two-node configuration) for their continued operations?

Could the stoppage of all virtual machines have been avoided if the file server that acts as the cluster witness was sitting inside the same rack and was directly connected to the same two switches as the cluster servers? I am thinking that it would not help because is was added as a witness with its share-name, so it is likely dependent on DNS lookup to access it.

Would it even help to add more nodes to the cluster? I realise that many frown upon the 2-node setup, but I suspect that having more nodes would not help in this case.

Are there recommendations for how to do something like this? Should we add a domain controller as a virtual machine in the cluster? Would that have avoided the vm stoppage? The folks that manage the AD at the client are apparently very restrictive about locations of domain controllers, so this is not something that we can easily do.


Power down AAG cluster each night

$
0
0

We have a non-production SQL AAG. One primary replica and one secondary replica with a continuous sync. Its mainly used for testing. To save some money we would like to shut down all nodes in the cluster each night and bring them back up again in the morning.

I have the scripts to do this – but I wondered if there would be any issues with simply powering down all nodes at once and then bringing them back up again in the morning all at the same time?

There are 3 servers (all Server 2016)– one is the SQL primary replica, one is the SQL secondary replica and there is a quorum server. Will I be causing issues with them all closing down and starting up within minutes of one another or will they be able to work themselves out automatically without any intervention each morning? I'd rather not have a battle each morning fixing the aag!

Thanks!

Loopback adapter for DR load balancing breaks failover cluster

$
0
0

I am not completely sure whether failover clustering is the right forum for this or whether I should rather post in Exchange. However, I think the root issue is failover clustering as I am experiencing something similar to https://social.technet.microsoft.com/Forums/windowsserver/en-US/7616b0e5-6fb6-4be7-a859-14baa2e9b925/cluster-network-is-partitioned-due-to-loopback-adapter?forum=winserverClustering.

The setup is an Exchange 2019 CU3 DAG on three Windows Server 2019 nodes, which uses a failover cluster without dedicated management port under the hood. What I want to achieve is layer 4 load balancing using direct response, i.e. the load balancer rewrites the MAC addresses of incoming requests to direct them to the three Exchange nodes.

In order to achieve this, I need to add the shared IP of the load balancer to all three Exchange nodes such that they will accept the redirected packets. The only way I know to do this is adding a loopback adapter via the device manager, add the IP, set its subnet mask to 255.255.255.255 to prevent it from being advertised via ARP and enable weak send and receive.

The setup itself is working, so I can access the Exchange services via the load balancer, but as in the link above, the failover cluster breaks after a short period of time making everything inaccessible. If I disable the loopback adapter, the network partitioning disappears and the cluster is up again.

I found some discussions on this issue which emphasised on the importance of ensuring the right order of adapters. In my understanding, the interface metric is the only way to do this on Server 2019, so I set the loopback adapter to 1000 on all machines. Initially, I also had IPv6 enabled, but was unable to tell the failover cluster not to use the loopback adapter in this case. Although I set the prefix length to 128, it still showed up with its link-local address. I also tried to tell the cluster to explicitly not use it using (Get-ClusterNetwork "Loopback network").Role = 0 as described on https://blogs.technet.microsoft.com/askcore/2014/02/19/configuring-windows-failover-cluster-networks/, but this command has no effect at all - the role does not change. Only removing IPv6 removes the loopback network from the list of cluster networks, but the network is still partitioned (I think because the shared IP on the loopback is in the same subnet as the physical one if you use the netmask of the physical one).

What am I missing here? There must be a way to configure this, because people seem to make DR load balancing work with Exchange.



2016 Server Hyper-V Reverts to .XML files after joining a cluster

$
0
0

Has anyone else noticed a behavior in Windows Server 2016 that it reverts to using the old format .XML files for VM configurations after joining a cluster? In this case the cluster was 2012R2 functional level which may affect it?

The problem we're having is that we had some local VMs on the machines and as soon as the servers joined the cluster, the running machines disappeared from all management. They are still running strangely enough, but they no longer show in Get-VM or in Hyper-V manager. 

So, I RDP into one of them and shut it down and tried to re-import it, but 2016 would not even let me re-import it with the VMCX configuration files, it said 'No Virtual Machines Found' in that folder. I had to re-create it, attached the VHDX and it created the old style XML files.

I'm wondering if maybe this has to do with the functional level, but all the VMs on the cluster have XML files, even ones created with 2016, so I'm thinking it might just be intentional?

Anyone seen this behavior?

Thanks!

Errors in Cluster Validation Test for two node File Server Cluster

$
0
0

Hello,

I have a two Node Windows Server 2012 R2 File Server Failover Cluster.

Today i ran the Cluster Validation Test with the result, that there are some errors:

1. List Network Binding Order
Description: List the order in which networks are bound to the adapters on each node.
FQDN
An error occurred while executing the test.
The network path was not found.

2.Validate Cluster Service and Driver Settings
Description: Validate startup settings used by services and drivers, including the Cluster service, cluster storage, Cluster Shared Volumes, and the cluster virtual network adapter.
Validating the Cluster service (clussvc) on FQDN.
An error occurred while executing the test.
An error occurred getting the cluster node state for 'FQDN'.

The network path was not found

3.Validate Memory Dump Settings
Description: Validate that none of the nodes currently requires a reboot (as part of a software update) and that each node is configured to capture a memory dump if it stops running.
Validating software configuration.
An error occurred while executing the test.
The network path was not found.

So these are the Errors. This is the only Cluster where I have this Errors.

Any help would be great.

Thanks

Duplicate disk signatures or disk GUIDs

$
0
0

Two disks have been found on node ***** with duplicate disk signatures or disk GUIDs. The disks involved are physical disk 1 and physical disk 2. There may be a Multipath I/O (MPIO) problem such as the software is not installed or is not working properly. If MPIO is not involved, or MPIO has been verified as working, you must either mask one of these disks off at this node, or run validation and specify a disk list that includes only one of these disks, for example by using the Test-Cluster cmdlet in Windows PowerShell.

I am getting this error while running Cluster validation in windows Server 2016.

All Disks formatted and renamed.

MPIO installed and added

DELL storage connected through fiber

Total 9 DISKS only for 2 DISKS it is failing.

Can someone quickly help me on this?

Why is this issue occuring?

Is it windows error or Storage configuration wrong?

Please assist?

Thanks

Failover Cluster is functioning but errors are generated

$
0
0

Hi All, 

I have a failure cluster that consist of two nodes with storage based on S2D.

Every day I'm getting errors:

1205 - The Cluster service failed to bring clustered role 'Cluster Group' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

1069 - Cluster resource 'File Share Witness' of type 'File Share Witness' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. 

1564 - File share witness resource 'File Share Witness' failed to arbitrate for the file share '\\SITE2VMHOST01\WITNESS'. Please ensure that file share '\\SITE2VMHOST01\WITNESS' exists and is accessible by the cluster.

1562 - File share witness resource 'File Share Witness' failed a periodic health check on file share '\\SITE2VMHOST01\WITNESS'. Please ensure that file share '\\SITE2VMHOST01\WITNESS' exists and is accessible by the cluster.

1688 - Cluster network name resource detected that the associated computer object in Active Directory was disabled and failed in its attempt to enable it. This may impact functionality that is dependent on Cluster network name authentication.

Network Name: Cluster Name
Organizational Unit: OU=Servers,OU=HO,DC=company,DC=com
Guidance: Enable the computer object for the network name in Active Directory.

1258 - Cluster network name resource failed registration of one or more associated DNS name(s) because the a DNS server could not be reached.

Cluster Network name: 'Cluster Name'
DNS Zone: 'company.com'
DNS Server: '192.168.5.12,192.168.7.20'

Ensure that the network adapters associated with dependent IP address resources are configured with at least one accessible DNS server.

Network Name: Cluster Name
Organizational Unit: OU=Servers,OU=HO,DC=company,DC=com
Guidance: Enable the computer object for the network name in Active Directory.

1254 - Clustered role 'Cluster Group' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

Several comments from me RE some errors:

1. FSW exists and accessible by the cluster (means failover cluster object has full control for the folder configured as FSW).

2. At least one of the DNS server addresses is accessible.

Interestingly the Failover Cluster itself is functioning as intended (except those error messages) and all roles are up and running. 

Just wondering how come the error messages are generated if the Failover Cluster works fine?

Cluster goes offline whenever I upgrade vmware tools on offline node and reboot

$
0
0

I have an older 2008R2 cluster that is still extremely critical.  We will be upgrading it soon and not looking for responses about that, but I have noticed twice that when I take a node offline for maintenance I can run updates on it reboot it multiple times with no issues to the primary node and the cluster.  When I upgrade vmware tools on the offline node and reboot, the entire cluster goes down.  The cluster service stops on the online node.  I have had this problem happen twice and it occurs with both nodes when they are offline and being updated.  My sample size is extremely high because I will reboot a few ties for windows updates with no issues and then I run vmware tools, which updates networking as well, and reboot and the entire cluster crashes.  Any ideas?

Here are the two errors on the primary node when the offline node reboots after a vmware tools upgrade.

Event 1561 - 

The cluster service has determined that this node does not have the latest copy of cluster configuration data. Therefore, the cluster service has prevented itself from starting on this node. 
Try starting the cluster service on all nodes in the cluster. If the cluster service can be started on other nodes with the latest copy of the cluster configuration data, this node will be able to subsequently join the started cluster successfully. 

If there are no nodes available with the latest copy of the cluster configuration data, please consult the documentation for 'Force Cluster Start' in the failover cluster manager snapin, or the 'forcequorum' startup option. Note that this action of forcing quorum should be considered a last resort, since some cluster configuration changes may well be lost.

Event 1177 - 

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. 
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Thanks,


Dave




Viewing all 5654 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>