Quantcast
Channel: High Availability (Clustering) forum
Viewing all 5654 articles
Browse latest View live

SOFS and connections

$
0
0

Hey

I would like to create an active/active file server for User Profile Disks (vhdx)

I have create a SOFS cluster (Virtual machines + shared disk + CSV).

To my understandig the SOFS setup still uses one (disk owner) server to do the IO to the disk (shared disk on SAN)

If its true - how do I create an "real" active/active cluster? 

(Using Windows Server 2019)

I have approx 1000 concurrent connections....

Mike


Cluster startup

$
0
0

Suppose I have a 4 nodes cluster + 1 FSW. And at the beginning the 4 nodes are shutdown .

I would like to startup the first 2 nodes, however the cluster seems identify down. (maybe not enough quorum).

When I startup the 3rd nodes, the cluster is up.

May I know this is normal ? I suggest wonder the vote of FSW should also make the cluster have quorum .

 

Clustering between a physical server DASD and a VM

$
0
0

Scenario :

Physical Server (PS1) with DASD : WIndows 2012 R2 (Multiple terabytes of data on multiple partitions/drives)

Virtual Server 1 (VS1): Windows 2016 (target server Day 1)

Virtual Server 2 (VS2): Windows 2016 (target server Day 2, once all data is synced with VS1 PS1 will be decommissioned)

In the past I have created clusters but with new shared drives with 0 data on them on day 1. 

Now the challenge is to create a cluster without creating a new drive but to use an existing drive on PS1 and replicate the data to VS1 and then later to VS2 (in shared mode) and then decommission PS1.

All this without losing PS1 service availability and not erasing data at the source server.


Luis M Astudillo Freelance Enterprise/Infrastructure Architect and Technology Strategic Planner LinkedIn: www.linkedin.com/in/luisma

Storage Spaces Direct / Cluster Virtual Disk goes offline when rebooting a node

$
0
0

Hello

We have several Hyper-converged einvoronments based on HP ProLiant DL360/DL380.
We have 3 Node and 2 Node Clusters, running with Windows 2016 and actual patches, Firmware Updates done, Witness configured.

The following issue occurs with at least one 3 Node and one 2 Node cluster:
When we put one node into maintenance mode (correctly as described in microsoft docs and checked everything is fine) and reboot that node, it can happen, that one of the Cluster Virtual Disks goes offline. It is always the Disk Performance with the SSD only storage in each environment. The issue occurs only sometimes and not always. So sometimes I can reboot the nodes one after the other several times in a row and everything is fine, but sometimes the Disk "Performance" goes offline. I can not bring this disk back online until the rebooted node comes back online. After the node which was down during maintenance is back online the Virtual Disk can be taken online without any issues.

We have created 3 Cluster Virtual Disks & CSV Volumes on these clusters:
1x Volume with only SSD Storage, called Performance
1x Volume with Mixed Storage (SSD, HDD), called Mixed
1x Volume with Capacity Storage (HDD only), called Capacity

Disk Setup for Storage Spaces Direct (per Host):
- P440ar Raid Controller
- 2 x HP 800 GB NVME (803200-B21)
- 2 x HP 1.6 TB 6G SATA SSD (804631-B21)
- 4 x HP 2 TB 12G SAS HDD (765466-B21)
- No spare Disks
- Network Adapter for Storage: HP 10 GBit/s 546FLR-SFP+ (2 storage networks for redundancy)
- 3 Node Cluster Storage Network Switch: HPE FlexFabric 5700 40XG 2QSFP+ (JG896A), 2 Node Cluster directly connected with each other

Cluster Events Log is showing the following errors when the issue occurs:

Error 1069 FailoverClustering
Cluster resource 'Cluster Virtual Disk (Performance)' of type 'Physical Disk' in clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Warning 5120 FailoverClustering
Cluster Shared Volume 'Performance' ('Cluster Virtual Disk (Performance)') has entered a paused state because of 'STATUS_NO_SUCH_DEVICE(c000000e)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Error 5150 FailoverClustering
Cluster physical disk resource 'Cluster Virtual Disk (Performance)' failed.  The Cluster Shared Volume was put in failed state with the following error: 'Failed to get the volume number for \\?\GLOBALROOT\Device\Harddisk10\ClusterPartition2\ (error 2)'

Error 1205 FailoverClustering
The Cluster service failed to bring clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Error 1254 FailoverClustering
Clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

Error 5142 FailoverClustering
Cluster Shared Volume 'Performance' ('Cluster Virtual Disk (Performance)') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

Any hints / inputs appreciated. Had someone something similar?

Thanks in advance

Philippe



Simulate simultaneously failure nodes

$
0
0

Are there any way I can simulate simultaneously failure nodes? I use task scheduler to set shutdown at the same time, however

seems can't get my expected results. THe dynamic vote seems changing too fast/

Windows NLB takes too long to recover/failback

$
0
0

I've built a fresh Windows Server 2016 NLB cluster of two Web Application Proxy servers.

They are both Hyper-V VMs on the same host and with MACAddressSpoofing set to ON.

Both servers are dual-NIC and the cluster is using the second dedicated NIC in each case.

The Cluster is set to Unicast.

The cluster constantly responds to pings regardless of which node is down - this is good. 

BUT... the cluster fails to respond to HTTPS traffic for about a minute while either node is restarted, whether at the VM or NLB control level. As mentioned above, the ping is steady throughout.

Is this expected behaviour?

WSFC broken, please help diagnose

$
0
0

I have a 2016 WSFC with file server role. 2 Nodes in the cluster shared storage. We lost Power to Node2 which died, when bringing it back up it wont join the cluster (shows 'Down' in failover cluster manager). If I shut down the entire cluster completley and start it on Node2 first, Node2 runs the cluster fine but Node1 now wont join the cluster (shows 'Down')

As far as I can tell all connectivity seems fine, I've turned off windows firewall, the network between the two servers is working fine and no firewalls in between the two nodes. Other clusters are running on the same infrastructure.

The only hints in failover cluster manager are that the Network connection for Node2 shows as offline (the network is up and working has the allow traffic and management ticked, can ping, RDP etc.

When I shutdown then restart the entire cluster Node2 first, roles become reversed, Node1 now shows network as offline, information details or crytical events for network have no entries

Crytical Events for Node2 itself, when in down state show: Error 1653 Cluster node 'Node2' failed to join the cluster because it could not communicate over the network with any other node in the cluster. Verify network connectivity and configuration of any network firewalls. - however im not convinvced this is actually the issue because of the below error messages:

The failover clustering log is as follows:

00000774.00001c4c::2018/05/15-16:48:50.659 INFO  [Schannel] Server: Negotiation is done, protocol: 10, security level: Sign00000774.00001c4c::2018/05/15-16:48:50.663 DBG   [Schannel] Server: Receive, type: MSG_AUTH_PACKAGE::Schannel, buf: 16100000774.00001c4c::2018/05/15-16:48:50.712 DBG   [Schannel] Server: ASC, sec: 90312, buf: 205900000774.00001c4c::2018/05/15-16:48:50.728 DBG   [Schannel] Server: Receive, type: MSG_AUTH_PACKAGE::Schannel, buf: 199200000774.00001c4c::2018/05/15-16:48:50.730 DBG   [Schannel] Server: ASC, sec: 0, buf: 5100000774.00001c4c::2018/05/15-16:48:50.730 DBG   [Schannel] Server: Receive, type: MSG_AUTH_PACKAGE::Synchronize, buf: 000000774.00001c4c::2018/05/15-16:48:50.730 INFO  [Schannel] Server: Security context exchanged for cluster00000774.00001c4c::2018/05/15-16:48:50.735 DBG   [Schannel] Client: ISC, sec: 90312, buf: 17800000774.00001c4c::2018/05/15-16:48:50.736 DBG   [Schannel] Client: Receive, type: MSG_AUTH_PACKAGE::Schannel, buf: 6000000774.00001c4c::2018/05/15-16:48:50.736 DBG   [Schannel] Client: ISC, sec: 90312, buf: 21000000774.00001c4c::2018/05/15-16:48:50.749 DBG   [Schannel] Client: Receive, type: MSG_AUTH_PACKAGE::Schannel, buf: 213300000774.00001c4c::2018/05/15-16:48:50.752 DBG   [Schannel] Client: ISC, sec: 90364, buf: 5800000774.00001c4c::2018/05/15-16:48:50.753 DBG   [Schannel] Client: ISC, sec: 90364, buf: 1400000774.00001c4c::2018/05/15-16:48:50.753 DBG   [Schannel] Client: ISC, sec: 90312, buf: 6100000774.00001c4c::2018/05/15-16:48:50.754 DBG   [Schannel] Client: Receive, type: MSG_AUTH_PACKAGE::Schannel, buf: 7500000774.00001c4c::2018/05/15-16:48:50.754 DBG   [Schannel] Client: ISC, sec: 0, buf: 000000774.00001c4c::2018/05/15-16:48:50.754 INFO  [Schannel] Client: Security context exchanged for netft00000774.00001c4c::2018/05/15-16:48:50.756 WARN  [ClRtl] Cannot open crypto container (error 2148073494). Giving up.00000774.00001c4c::2018/05/15-16:48:50.756 ERR   mscs_security::SchannelSecurityContext::AuthenticateAndAuthorize: (-2146893802)' because of 'ClRtlRetrieveServiceSecret(&secretBLOB)'00000774.00001c4c::2018/05/15-16:48:50.756 WARN  mscs::ListenerWorker::operator (): HrError(0x80090016)' because of '[SV] Schannel Authentication or Authorization Failed'00000774.00001c4c::2018/05/15-16:48:50.756 DBG   [CHANNEL 172.23.1.15:~56287~] Close().

specifically:

Server: Negotiation is done (aka they talked to eachother?)
[ClRtl] Cannot open crypto container (error 2148073494). Giving up. mscs_security::SchannelSecurityContext::AuthenticateAndAuthorize: (-2146893802)' because of 'ClRtlRetrieveServiceSecret(&secretBLOB)' mscs::ListenerWorker::operator (): HrError(0x80090016)' because of '[SV] Schannel Authentication or Authorization Failed'

I cant find many if any articles dealing with these messages, the only ones I can find, say to make sure permissions are correct on  %SystemRoot%\Users\All Users\Microsoft\Crypto\RSA\MachineKeys 

I did have to change some of the permissions on these files but still couldnt join the cluster. Other than that im struggling to find any actual issues (SMB access from node1 to node2 appears to be fine, smb access from node2 to node1 appears to be fine, dns appears to be working fine, file share whitness seems to be fine)

Finally the cluster vlaidations report shows these two errors as the only errors with the cluster

Validate disk Arbitration: Failed to release SCSI reservation on Test Disk 0 from node Node2.domain: Element not found.

Validate CSV Settings: Failed to validate Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT). The connection was attempted with the Cluster Shared Volumes test user account, from node Node1.domain to the share on node Node2.domain. The network path was not found.

Validate CSV Settings: Failed to validate Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT). The connection was attempted with the Cluster Shared Volumes test user account, from node Node2.domain to the share on node Node1.domain. The network path was not found.

other errors from the event logs

ID5398 Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates. .Votes required to start cluster: 2 Votes available: 1Nodes with votes: Node1 Node2 Guidance:Attempt to start the cluster service on all nodes in the cluster so that nodes with the latest copy of the cluster configuration data can first form the cluster. The cluster will be able to start and the nodes will automatically obtain the updated cluster configuration data. If there are no nodes available with the latest copy of the cluster configuration data, run the 'Start-ClusterNode -FQ' Windows PowerShell cmdlet. Using the ForceQuorum (FQ) parameter will start the cluster service and mark this node's copy of the cluster configuration data to be authoritative.  Forcing quorum on a node with an outdated copy of the cluster database may result in cluster configuration changes that occurred while the node was not participating in the cluster to be lost.

ID4350 Cluster API call failed with error code: 0x80070046. Cluster API function: ClusterResourceTypeOpenEnum Arguments: hCluster: 4a398760 lpszResourceTypeName: Distributed Transaction Coordinator lpcchNodeName: 2

Lastly I built another Server node3 to see if I could join it to the cluster but this fails:

* The server 'Node3.domain' could not be added to the cluster. An error occurred while adding node 'Node3.domain' to cluster 'CLUS1'. Keyset does not exist

ive done the steps here with no joy, http://chrishayward.co.uk/2015/07/02/windows-server-2012-r2-add-cluster-node-cluster-service-keyset-does-not-exist/



SOFS and load balancing

$
0
0

Hey

Just create a Cluster running Scale-Out Fileserver.

I need to have an active/active system due to many connection.

When looking into open files - it seems users only connect to one server in the cluster. (I have enable continuous avalibility)

Why?

Mike


Sysadm


Unable to connect to cluster using Failover Cluster Manager

$
0
0

In the last day or so I've discovered that I'm no longer able to connect to my cluster using Failover Cluster Manager.  When I try to connect it gives me the error "An error occurred connecting to the cluster <cluster name>.  An error occurred trying to display the cluster information.  One or more errors occurred."

If I log onto the node that's hosting the "Cluster Group" cluster resource, and open Failover Cluster Manager and connect to <Cluster on this server...> it gives me a similar message.

As far as I can see, the Cluster Powershell commands still work, so I've moved the "Cluster Group" group to another node, but I am still unable to connect to the cluster using Failover Cluster Manager.

The Cluster Service is running, and I'm reluctant to restart it becuase I know my Virtual Machines are going to start failing or failing over when I do this.

I have verified that I am logged in as a super-duper user, even tried logging as the Domain Administrator with no luck.  I have also tried the WMI Mofcomp suggestion as described in http://www.yusufozturk.info/windows-server/failover-cluster-manager-an-error-occurred-to-display-cluster-information.html with no luck.

Any suggestions?

Windows AD user password reset details through powershell script!!!

$
0
0

Hello All,

How to get the Password change/reset logs from AD server through Powers-shell Command with on daily basis automatic

scripting.Currently i am using below command to get the last reset details.


get-aduser -filter * -properties passwordlastset, passwordneverexpires |ft Name, passwordlastset, Passwordneverexpires



dinesh kumar

Stretch Cluster / Storage Replica / Log volume VSS Snapshots due to built-in Cluster Config Backup ?

$
0
0

Hi fellow Engineers.

I am currently investigating an annoying issue on a virtual (!) WFC (v2016) that is being used as a 4-node HA FileServer (2node HA in each datacenter). Storage (2x 5TB) is being replicated succesfully (synchronous/write-ordered) between the 2 datacenters. 

The annoying issue, is that every 4 hours (randomized within a few minutes) all connected users experience a short freeze of a few seconds up to a minute when accessing the Fileserver. Looking at the logs and StorageReplica known issues, it is clear this is due something trying to create a VSS Snapshot of the Replica LOG volume (which you should not do !!!), and the culprit seems to be an internal mechanism trying to create a Cluster Config Backup - including VSS snapshot of all local volumes of the Role owner. 

If I switch the Role to another node, the issue just follows so it is not tied to the Cluster owner, but the role owner !

There is no backup being scheduled at that time, and I have no idea what would create an automatic VSS of all connected volumes.

Before going into details ... I have troubleshooted the hell out of this thing and cannot find it ... I do have some ideas, but the timestamps do not match.

Environment:

Lenovo blades (x240) with VMWare 6.5u1 (was also present on 6.5)

Dell Compellent Storage

3x 1Gbit uplinks (VMXNET3) per Node 

Veeam Backup & Replication (9.5u2) using the latest Veeam Agent so we are not using the VMWare API for backup. When the Veeam Agent Backup schedule runs, the issue is not present as only the datavolumes are being backed-up (using VSS).

Anyone else having the same issue or seen this issue ?

windows 2016 cluster QuarantineThreshold

Storage Spaces Direct is the most unreliable, non highly-available high-availability solution Microsoft have ever released!

$
0
0

Storage Spaces Direct is the most unreliable, non highly-available high-availability solution Microsoft have ever released! We purchased a fully validated S2D solution from Dell and in two years since we put it in we have had:
> Total Cluster Failure with complete loss of data on CSV's requiring a full rebuild from scratch and complete DR of over 200 virtual machines!
> Hosts crashing during routine patching due to differences in the binaries causing CSVs to go offline and VMs to crash
> Bug after bug. 
> To fix the issues that were introduced in the May 2018 UR that in itself is a massive risk to install October UR that fixes it as during the patching cycle to install the Oct 2018 UR expect to see the loss of storage.
> There are still some serious bugs in S2D that Microsoft have been unable to fix and still no news.

> We have been advised by Dell that to patch S2D at present they for all customers they do not recommend carrying this out online and to either shutdown all VM roles on the cluster and patch or migrate all roles off the cluster to patch.

WTF!



Microsoft Partner

Windows 2016 DataCentre : Cannot Create Storage Replica for stretch cluster deployment.

$
0
0

I have 4 virtual machines running Windows 2016 with failover clustering.

These are spread across 2 data centers (2 VMs in each data center).

Each pair of VMs has 2 shared 'vhd sets' for Log and Replica Volumes

Cluster is configured and passes checks.

File server role is configured on site A and Proposed Replica Volume is selected.

When I try to replicate the volume with the file server I get the following error when the wizard tries to complete:

* Failed to create replication.

ERROR CODE: 0x80131500:

NATIVE ERROR CODE : 3 .

Invalid namespace

There is now firewall on these servers at the moment and there is also no firewall between the data centers.

wmi seems to be running on the nodes as well.

Cluster passes validation checks with some minor warnings about single network adaptor etc.


Clustering windows server 2016 Data Center

$
0
0

Hello,

I have been asked to Cluster 2 HP Proliant Servers 380 Gen 10.

On both servers Raid 1 (2 first Disk)  and Raid 5 (three last disk ) have been configured.

I have already installed windows server 2016 Data Center on both of them and Hyper V also installed.

The ILOs have been configured.

I need help to configure cluster with Server 1 and server 1 and connect them physically for full redundancy on these 2 Cisco 3850 Switches.

I need heart beat configured.

Can someone help me or direct me to a forum or blof where I can follow the step by step process to do the clustering and physically connect these devices.

Just a side note that later there will be a SAN added and it will be connected to both of those switches.

Thanks a lot

Abhi



Firewall block traffic of heart beat

$
0
0

I want to do some test in cluster to lost heartbeat, I want to use windows firewall to perform this .

what port I should block ?

I tried block UDP and TCP 3343, but seems still not working .

Guys caw we get the RegKey back for the passing WSSD on the windows server SSD cluster please ...

$
0
0

as per the official Microsoft position on windows server 2019 Data Centre 

"...When can I deploy Storage Spaces Direct in Windows Server 2019 into production?

Microsoft recommends deploying Storage Spaces Direct on hardware validated by the WSSD program. For Windows Server 2019, the first wave of WSSD offers will launch in February 2019, in about three months.

If you choose instead to build your own with components from the Windows Server 2019 catalog with the SDDC AQs, you may be able to assemble eligible parts sooner. In this case, you can absolutely deploy into production – you’ll just need to contact Microsoft Support for instructions to work around the advisory message. ..."


regards,

Alex 


S2D IO TIMEOUT when rebooting node

$
0
0

I am building a 6 Node cluster, 12 6TB drives, 2 4TB Intel p4600 PCIe NVME drives - Xeon Plat 8168/768GB Ram, LSI9008 HBA.

The cluster passes all tests, switches are properly configured and the cluster works well, exceeding 1.1 million IOPS with VMFleet. However, at current patch as of now (April 18 2018) I am experiencing the following scenario:

When no storage job is running, all vdisks listed as healthy and I pause a node and drain it, all is well, until the server actually is rebooted or taken offline. At that point a repair job is initiated, and IO suffers badly, and can even stop all together, causing vdisks to go in to paused state due to IO timeout. (listed as the reason in cluster events) Exacerbating this issue, when the paused node reboots and joins, it will cause the repair job to suspend, stop, then restart (it seems.. tracking this is hard was all storage commands become unresponsive while the node is joining) At this point io is guaranteed to stop on all vdisks at some point for long enough to cause problems, including causing VM reboots. The cluster was initially formed using VMM 2016. I have tried manually creating the vdisks, using single resiliency (3 way mirror), multi tier resiliency, same effect. This behavior was not observed when I did my POC testing last year. Its frankly a deal breaker and unusable, as if I cannot reboot a single node without stopping entirely my workload, I cannot deploy. I'm hoping someone has some info. I'm going to re-install with Server 2016 RTM media and keep it unpatched, and see if the problem remains. However it would be desirable to at least start the cluster at full patch. Any help appreciated. Thanks


How can we move the Quorum Disk from Node1 to Node2 ? - Windows 2012 R2 - Hyper-V Clustering

$
0
0

Hello,

We have created a cluster with 2 Nodes and created a role for File share. There are totally 3 Disks in the cluster, among three we have allocated 1 disk as Quorum Disk.

When Node1 is powered off, all the 3 disks are moving to Node2 automatically. But I would like to know how can we move the Quorum Disk from Node1 to Node 2 when both the nodes are active ?

We can move the 2 Disks from Node1 to Node2 while both the Nodes are Powered On, but through the same option I am unable to Move the Quorum Disk from Node1 to Node2 (Right Click on the Disk -> Move -> Select Node).

Kindly suggest on this  !!

Thanks & Regards,

Anoop Nair.


Anoop Nair

windows 2019 s2d cluster failed to start event id 1809

$
0
0

Hi I have lab with insider windows 2019 cluster which I inplace upgraded to rtm version of 2019 server and cluster is shutdown after while and event id 1809 is listed 

This node has been joined to a cluster that has Storage Spaces Direct enabled, which is not validated on the current build. The node will be quarantined.
Microsoft recommends deploying SDDC on WSSD [https://www.microsoft.com/en-us/cloud-platform/software-defined-datacenter] certified hardware offerings for production environments. The WSSD offerings will be pre-validated on Windows Server 2019 in the coming months. In the meantime, we are making the SDDC bits available early to Windows Server 2019 Insiders to allow for testing and evaluation in preparation for WSSD certified hardware becoming available.

Customers interested in upgrading existing WSSD environments to Windows Server 2019 should contact Microsoft for recommendations on how to proceed. Please call Microsoft support [https://support.microsoft.com/en-us/help/4051701/global-customer-service-phone-numbers].

Its kind weird because my s2d cluster is running in VMs is there some registry switch to disable this stupid lock ???


Viewing all 5654 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>