Create Failover Cluster/New-Cluster fails to complete on Windows Server 2016

March 20, 2018, 12:58 pm

≫ Next: HyperV on different server versions

≪ Previous: Failover Cluster Validation Fails with SMB Share Access Error

Good afternoon,

Need help with what seems to be a simple task, but continues to fails. We’re trying to build a Windows 2016 Failover Cluster, which continues to fail. Windows 2012 R2 Failover Cluster is successful, same domain, accounts. Here are the details on each configuration. Will be glad to provide additional information that could help.

Thanks, -jim

Windows 2016 Failover Cluster

AD – Windows 2016 domain

FFL – Windows 2012 R2 Forest Functional Level

DFL – Windows 2016 Domain Functional Level

2 servers, Windows 2016 Datacenter

Event Viewer – FailoverClustering DiagnosticVerbose log enabled

Results: Cluster Validation passes, select build cluster from test details. Build immediately fails, very little details in the cluster.log (see details below)

Same results via GUI or with PS New-cluster cmdlet

Windows 2012 R2 Failover Cluster

AD – Windows 2016 domain

FFL – Windows 2012 R2 Forest Functional Level

DFL – Windows 2016 Domain Functional Level

2 server, Windows 2012 R2 Standard

Results: Build completes successfully, ton of details in the cluster.log

Some additional points/details…

- Create Cluster Wizard report shows the 'bind to domain controller . more data is available.' Error (see details below).

- Prestaged the CNO, no difference with or without.

- We've also tried the build with and without the 'Deny Access to this computer from the Network' policy set. Still fails.

- Cluster DiagnosticsVerbose logs are not showing much details/errors.

- Tried alternate pair of Win2016 servers in two domains of forest, same error.

- Seems to be a permissions error in AD since the failure happens right after the cluster build dialog that states 'Find a suitable domain controller for node <nodename>'

Cluster.log from failed Windows 2016 build…

00002a78.00002b2c::2018/03/20-14:54:06.249 DBG Cluster node cleanup thread started.

00002a78.00002b2c::2018/03/20-14:54:06.249 DBG Starting cluster node cleanup...

00002a78.00002b2c::2018/03/20-14:54:06.249 DBG Disabling the cluster service...

00002a78.00002b2c::2018/03/20-14:54:06.251 DBG Releasing clustered storages...

00002a78.00002b2c::2018/03/20-14:54:06.252 DBG Getting clustered disks...

00002a78.00002b2c::2018/03/20-14:54:06.252 DBG Waiting for clusdsk to finish its cleanup...

00002a78.00002b2c::2018/03/20-14:54:06.253 DBG Clearing the clusdisk database...

00002a78.00002b2c::2018/03/20-14:54:06.254 DBG Waiting for clusdsk to finish its cleanup...

00002a78.00002b2c::2018/03/20-14:54:06.255 DBG Relinquishing clustered disks...

00002a78.00002b2c::2018/03/20-14:54:06.255 DBG Opening disk handle by index...

00002a78.00002b2c::2018/03/20-14:54:06.258 DBG Getting disk ID from layout...

00002a78.00002b2c::2018/03/20-14:54:06.258 DBG Reset CSV state ...

00002a78.00002b2c::2018/03/20-14:54:06.259 DBG Relinquish disk if clustered...

00002a78.00002b2c::2018/03/20-14:54:06.261 DBG Opening disk handle by index...

00002a78.00002b2c::2018/03/20-14:54:06.263 DBG Getting disk ID from layout...

00002a78.00002b2c::2018/03/20-14:54:06.264 DBG Reset CSV state ...

00002a78.00002b2c::2018/03/20-14:54:06.264 DBG Relinquish disk if clustered...

00002a78.00002b2c::2018/03/20-14:54:06.266 DBG Opening disk handle by index...

00002a78.00002b2c::2018/03/20-14:54:06.271 DBG Resetting cluster registry entries...

00002a78.00002b2c::2018/03/20-14:54:06.273 DBG Resetting NLBSFlags value ...

00002a78.00002b2c::2018/03/20-14:54:06.278 DBG Unloading the cluster Windows registry hive...

00002a78.00002b2c::2018/03/20-14:54:06.279 DBG Getting the cluster Windows registry hive file path...

00002a78.00002b2c::2018/03/20-14:54:06.280 DBG Getting the cluster Windows registry hive file path...

00002a78.00002b2c::2018/03/20-14:54:06.281 DBG Getting the cluster Windows registry hive file path...

↧

HyperV on different server versions

August 28, 2019, 9:19 pm

≫ Next: new-cluster static address was not found on any cluster network

≪ Previous: Create Failover Cluster/New-Cluster fails to complete on Windows Server 2016

Good day,

I am in the process of upgrading my HyperV servers from 2012r2 to 2016 and came across a question I had not thought about. Can I run a clustered HyperV system with two <g class="gr_ gr_171 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="171" id="171">differnet</g> versions of Server hosts, 2 - 2012r2 and 1 - 2016? I was going to just <g class="gr_ gr_356 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del" data-gr-id="356" id="356">crate</g> a new server and add it in but I am not able to find any documentation of <g class="gr_ gr_678 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="678" id="678">teh</g> best way to remove a node from the cluster and add one to the cluster. I would like to just build 3 2016 servers and add them to the cluster and remove the others,

Really any help would be appreciated, I ran across a couple of step by steps for other items and those seemed to help the staff be successful and since we are in the medical field everyone wants to be perfect.

Thank you in advance for either question suggestion or answer,

Michael Deininger

Southwest Transplant Alliance

↧

new-cluster static address was not found on any cluster network

August 16, 2019, 2:50 am

≫ Next: Powershell command to add "IP Address" to a "Network name" resource.

≪ Previous: HyperV on different server versions

Hi guys,

Recently my 2 node cluster got into some issue one of my node was not able to start up. I tried running start-custernode -forcequorum

but i got error Start -

start-clusternode: the system cannot find the file specified.

No solution found. So i went to the other node which still with the cluster and remove the cluster.

Removing of cluster is fine no issue. However i face another problem the moment i wanted to recreate the cluster.

The error shown was -

New-Cluster: Static address 'x.x.x.x' was not found on any cluster network.

If anyone know what going on please let me know. Thanks.

↧

Powershell command to add "IP Address" to a "Network name" resource.

August 29, 2019, 4:09 am

≫ Next: got event: LocalEndpoint xxxxxx:~3343~ has missed two-fifth consecutive heartbeats from xxxxxx:~3343~ for 2016 server

≪ Previous: new-cluster static address was not found on any cluster network

Hi,

Within Windows failover cluster manager I can see my SQL 2012 Always On Availability Group under Roles.
Within the Availability Group; I can add a Client Access Point (AG listener) - manually in the GUI or via the following command:

Add-ClusterResource -Name "crag3" -ResourceType "Network Name" -Group "Contoso-ag1"

Question is, how do I add the "IP Addresses" to this Network Name Resource using PowerShell?
I can do it via the GUI.

Thanks in advance for any help,
HM

Crag2 - was done manually using the GUI and works as expected.

Crag3 - I've created the first part using the above PS command; however I can't figure out how to add the "IP Address" now.

↧

got event: LocalEndpoint xxxxxx:~3343~ has missed two-fifth consecutive heartbeats from xxxxxx:~3343~ for 2016 server

August 29, 2019, 7:06 am

≫ Next: Error validating cluster computer resource name (Server 2016 Datacenter Cluster)

≪ Previous: Powershell command to add "IP Address" to a "Network name" resource.

Hello Team,

We are getting missed heartbeat alerts. below is the log from cluster log:

Node 1: Executing locally gumId: 11221, updates: 1, first action: /dm/update
00000b48.00002aa0::2019/08/18-10:08:56.164 INFO [GUM] Node 1: Executing locally gumId: 11222, updates: 1, first action: /dm/update
00000b48.00002aa0::2019/08/18-10:08:56.183 INFO [GUM] Node 1: Executing locally gumId: 11223, updates: 1, first action: /dm/update
00000b48.00002f98::2019/08/18-10:08:56.322 INFO [GUM] Node 1: Executing locally gumId: 11224, updates: 1, first action: /dm/update
00000b48.00002f98::2019/08/18-10:08:56.352 INFO [GUM] Node 1: Executing locally gumId: 11225, updates: 1, first action: /dm/update
00000b48.00001994::2019/08/18-10:09:02.319 INFO [IM] got event: LocalEndpoint 137.201.104.205:~3343~ has missed two-fifth consecutive heartbeats from 137.201.104.206:~3343~

cluster validation report :

The cluster network name xxxxxxx does not have Create Computer Objects permissions on the Organizational Unit OU=Servers,OU=Boise,OU=AMER,DC=na,DC=micron,DC=com. This can result in issues during the creation of additional network names in this OU.

The following servers have updates applied which are pending a reboot to take effect. It is recommended to reboot the servers to complete the patching process.

xxxxxxx

xxxxx

00000b48.00001994::2019/08/18-10:09:02.319 INFO [CHM] Received notification for two-fifth consecutive missed HBs to the remote endpoint 137.201.104.206:~3343~ from 137.201.104.205:~3343~

00000b48.00002aa0::2019/08/18-10:09:02.334 INFO [CHM] My weights have changed from 0 to 0 0 110

second server :

00002030.00002bc4::2019/08/18-10:09:08.931 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'io_subsystem' health state has been changed from 'clean' to 'warning' at 2019-08-18 04:09:08.577

00002030.00002bc4::2019/08/18-10:09:18.927 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'io_subsystem' health state has been changed from 'warning' to 'clean' at 2019-08-18 04:09:18.577

0000301c.00002e54::2019/08/18-10:09:26.324 INFO [IM] Changing the state of adapters according to result: <class mscs::InterfaceResult>

Please let us know why this heartbeat missing errors come in what kind of situations. Cx told it is not an issue with network. it is vmware

swathi

↧

Error validating cluster computer resource name (Server 2016 Datacenter Cluster)

January 27, 2017, 11:44 am

≫ Next: VM on CSV goes Pause-Critical

≪ Previous: got event: LocalEndpoint xxxxxx:~3343~ has missed two-fifth consecutive heartbeats from xxxxxx:~3343~ for 2016 server

An error occurred while executing the test.
The operation has failed. An error occurred while checking the Active Directory organizational unit for the cluster name resource.

The parameter is incorrect

Interesting enough the cluster name was created successfully in the Computers OU and the cluster can be taken offline and brought back online with no problem. The DNS entry is correct and the cluster name pings to the correct IP. Changing the name of the cluster will update the cluster computer name in AD with no errors.

↧

VM on CSV goes Pause-Critical

August 13, 2019, 10:46 pm

≫ Next: Scale out Cluster

≪ Previous: Error validating cluster computer resource name (Server 2016 Datacenter Cluster)

Hi guys,

I hope some of you can enlighten some questions.

How many hosts/nodes do you recommend in a Hyper-V Cluster?

We are having som CSV issues when thay are in some cases when moved to anoter hosts are about 30 to 60 seconds and therefor the VMs on "that" CSV is going in "Pause/Critical" mode, the hosts is connected to our NetAPP AllFlash SSD via Fiber Channel.
We have looked in to the number of hosts in the Clusters (the biggest Cluster have 15 hosts, and the smallest Cluster have 4 hosts) - all hosts have the same OS installed, patchlevel and so on - som Clusters are server 2016 and others server 2019.
It looks like it is something that develops over time (1+ month) and a reboot of all of the hosts (one at the time) seems to fix the Pause/Critical issue.

Please let me know i any of you have experiend something like.

Best and kind regards.

Please remember to mark the replies as answers if they help and unmark them if they provide no help.

↧

Scale out Cluster

August 30, 2019, 3:57 am

≫ Next: Cluster quorum best practice for SQL AlwaysON

≪ Previous: VM on CSV goes Pause-Critical

I have an application which is clustered, currently it writes a log to the hardware that is the active host, if the cluster fails over then the log is created and updated on the new host.

Can I use a scale out cluster to host the log file?

Im concerned about the comments of not changing file metadata rapidly. The log will write several times a second.

Thanks

↧

Cluster quorum best practice for SQL AlwaysON

September 3, 2019, 12:24 am

≫ Next: Windows Server 2016 Hyper-V cluster : Microsoft-Windows-Hyper-V-VMMS Event ID 20501

≪ Previous: Scale out Cluster

Hello,

I have a SQL alwaysON setup between on-prem and Azure SQL server (IaaS). Its a 2 node cluster and have configured file share witness for the quorum( On one of the Azure server). There is a site-to-site VPN configured to facilitate all these.

Of late , we see our primary DB going down due to instability in cluster. It was noticed that our network goes down there by making secondary node and file share in accessible and cluster gets affected . Due to which all the primary DBs becomes in accessible and application downtime.

This setup was primarily built for DR purpose and accordingly we have placed the quorum on Azure side.

How can we avoid this situation. I assume all this is due to unavailability of Azure node and File share witness ( this too on Azure) are in accessible at the same time whenever network goes down.

Is there any configuration changes I can make to make the on-prem stable irrespective of the status of secondary node.

what quorum method I should adapt for these kind of situation. Please help.

↧

Windows Server 2016 Hyper-V cluster : Microsoft-Windows-Hyper-V-VMMS Event ID 20501

March 13, 2018, 7:27 pm

≫ Next: CSV Autopause - Single client contification start.

≪ Previous: Cluster quorum best practice for SQL AlwaysON

I found this warning on all hyper-v cluster node.

This warning are generate every 1 minute.

The description for Event ID 20501 from source Microsoft-Windows-Hyper-V-VMMS cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

%%2147943788
0x8007056C

The locale specific resource for the desired message is not present

↧

CSV Autopause - Single client contification start.

September 3, 2019, 5:46 am

≫ Next: NLB Multicast Error

≪ Previous: Windows Server 2016 Hyper-V cluster : Microsoft-Windows-Hyper-V-VMMS Event ID 20501

HI,

I've just got a warning from my cluster that one of my CSVs was stopped. But I just dont get what was going on.

From the Failoverclustering-CsvFs protocol I get this message:

"Volume {44179469-89e8-4971-b9ff-057c4579c647} is autopaused. Status 0xC00000C4. Source: Single client contification start."

What does that even mean? Single Client contification?

Best Regards

Daniel

↧

NLB Multicast Error

September 3, 2019, 6:52 am

≫ Next: Cross subnet communication to Windows failover cluster

≪ Previous: CSV Autopause - Single client contification start.

Hi all,

We have two server are running under vmware(esxi) environment. And each VM machine is running Windows 2016 Std. OS and one static IP address. I installed and configured NLB on each server as per this Network Load Balancing documentation. I used "Multicast" mode during NLB configuration. I got below error when i open the NLB application.

Warning
---------------------------
Running NLB Manager on a system with all networks bound to NLB might not work as expected.
If all interfaces are set to run NLB in "unicast" mode, NLB Manager will fail to connect to hosts.
See "Help and Support Center" for unicast communication limitations.
---------------------------
OK
---------------------------

I don't know why i got above error message even i set "multicast" at nlb configuration. We would like to use "Multicast" mode instead of "Unicast". Do i need to set some configuration on VMware ESXI and switch or router to work NLB with multicast mode. We are using cisco switch.

Best Regards,

Yukon

Make Simple & Easy

↧

Cross subnet communication to Windows failover cluster

September 3, 2019, 1:10 pm

≫ Next: Problem with virtual disk on 4 node cluster.

≪ Previous: NLB Multicast Error

Hello Hopefully this is the right location to post this, Our team is setting up Windows failover cluster to be access from across subnets. initially this is in a test environment. the issue we are seeing is that the clients who are on subnet A cannot speak to the Cluster network object nor the IP for the cluster name, which is on Subnet B. The clients can communicate fine with each of the nodes directly just not the cluster name/IP. Also Clients that are on the Same Subnet B can speak to the Cluster Name/IP fine. just not cross subnet. The switch that is connecting both subnets is a cisco 2960G. windows firewall is disabled as well.

Below is a brief diagram of the setup

Client subnetA => Ciscos 2960 <=> Cluster subnetb

Since the clients can reach the host individually, and the clients that are on the same subnet can access the cluster resources fine. I am leaning towards this may require a feature/capability in the network to handle communicating to a mac/ip address handled by the cluster.

Can anyone here point me in the right direction, any assistance most appreciated.

↧

Problem with virtual disk on 4 node cluster.

August 4, 2019, 3:34 am

≫ Next: Server2016 Cluster network traffic coming from host ip rather than role ip

≪ Previous: Cross subnet communication to Windows failover cluster

Hi Guys

I am going out of my mind.. Been struggling with this for days unable to find something that can bring me along the right path.

My cluster was powered down when starting up and that resulted in a virtual disk being stuck in an "online pending" -> "Failed" -> "Online pending" loop. And then i tries to start it on another server. So it keeps bouncing around all 4 servers.

I have tried almost all articles i could find. When running get-storagejob i have 1 job that keeps running:

Name IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal
---- ---------------- ----------- -------- --------------- -------------- ----------
Repair True 00:01:25 Running 0 0 45097156608

It seems that every 2-3 minutes the jobs restarts. I am getting this info in the event log (Sorry for missing pics i was not allowed to post them):

EventID: 1069

Cluster resource 'Cluster Virtual Disk (HyperVDisk1)' of type 'Physical Disk' in clustered role '96fd0e69-9c2d-41c0-92e3-09bdcd126686' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

EventID: 5142

Cluster Shared Volume 'HyperVdisk1' ('Cluster Virtual Disk (HyperVDisk1)') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

EventID: 5142

Cluster Shared Volume 'HyperVdisk1' ('Cluster Virtual Disk (HyperVDisk1)') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

EventID: 1793

Cluster physical disk resource online failed.

Physical Disk resource name: Cluster Virtual Disk (HyperVDisk1)
Device Number: 5
Device Guid: {a75e8b5d-a226-4b0e-b6d4-cde8fffa4d1b}
Error Code: 5008
Additional reason: WaitForVolumeArrivalsFailure

EventID: 1795

Cluster physical disk resource terminate encountered an error.

Physical Disk resource name: Cluster Virtual Disk (HyperVDisk1)
Device Number: 5
Device Guid: {a75e8b5d-a226-4b0e-b6d4-cde8fffa4d1b}
Error Code: 1168

What i have tried:

This article from kreelbits: storage-spaces-direct-storage-jobs-hung

Tried optimize-storagePool and repair-virtualDisk with no success

Found a great article from JTpedersen on troubleshooting-failed-virtualdisk-on-a-storage-spaces-direct-cluster

Every time i tried to run:
Remove-Clustersharedvolume -name "Cluster Virtual Disk (HyperVDisk1)"

1 time i got that the job failed because the disk was moving to another server (Not the exact wording)

The normal response is it just hangs on the command and have been doing that for +24 hours.

To me it seems that the problem is that before any commands can get a hold of the disk it restarts the storageJob og moves the disk to another server and restarts the loop.

Thanks i advance.

/Peter

↧

Server2016 Cluster network traffic coming from host ip rather than role ip

August 22, 2019, 9:53 am

≫ Next: Unable to add the node in multi subnet cluster

≪ Previous: Problem with virtual disk on 4 node cluster.

Hello

I have two 2016 vm's in a hyper-v environment that are clustered. Each VM is on a separate physical host.

Each VM only has 1 nic. My clusters ip's are as follows:

172.18.1.113	ProductionIP - Role IP
172.18.1.114	Cluster IP
172.18.1.115	VM Host A
172.18.1.116	VM Host B

I've added the Role IP address (172.18.1.113) to an ipsec tunnel on my firewall, but my firewall see's the traffic as coming from either of the 2 host ip addresses (.115 or .116). If I ping the remote end of the ipsec tunnels host from the either host A or B and source it as the .113 the ping works, but by default it always takes host ip and fails.

How do I get the clusters nodes to always send traffic out of the role ip no matter which node is active?

Thanks

Dan

↧

Unable to add the node in multi subnet cluster

September 4, 2019, 10:54 pm

≫ Next: Storage space direct NIC setup

≪ Previous: Server2016 Cluster network traffic coming from host ip rather than role ip

This is 3 node production cluster ( multi subnet cluster). It was working fine. I evicted the DR node and tried to add it back, it throws an error.

I am having issues only in adding the DR node. However I can able to add the DR node to my non-prod or Dev existing cluster.

FailoverCluster EventViewer

Log Name: Microsoft-Windows-FailoverClustering/Operational
Source: Microsoft-Windows-FailoverClustering
Date: 5/09/2019 3:39:06 PM
Event ID: 1281
Task Category: Security Manager
Level: Information
Keywords:
User: SYSTEM
Computer: DRNODE.orionhealth.saas
Description:
Joiner tried to Create Security Context using Package='Kerberos/NTLM' with Context Requirement ='0' and Timeout ='40000' for the target = 'akl-shrd-pdb1'

Log Name: Microsoft-Windows-FailoverClustering/Operational
Source: Microsoft-Windows-FailoverClustering
Date: 5/09/2019 3:38:53 PM
Event ID: 1650
Task Category: Cluster Virtual Adapter
Level: Information
Keywords:
User: SYSTEM
Computer: DRNODE.orionhealth.saas
Description:
Microsoft Failover Cluster Virtual Adapter (NetFT) has missed more than 40 percent of consecutive heartbeats.

Local endpoint: 10.13.6.200:~3343~
Remote endpoint: 10.10.6.190:~3343~

Error in powershell:-

The clustered role was not successfully created. For more information view the report file below.
Report file location: C:\Windows\cluster\Reports\Add Node Wizard 76cd451a-538a-4fbe-9c52-2f9498396d17 on 2019.09.05 At 15.35.42.htm
Add-ClusterNode : An error occurred while performing the operation.
An error occurred while adding nodes to the cluster 'CLUST'.
An error occurred while adding node 'NODE3' to cluster 'CLUST'.
This operation returned because the timeout period expired

DR Cluster log:

00000918.00001528::2019/09/05-00:59:08.498 DBG [NETFTAPI] Signaled NetftLocalConnect event for fe80::14:a91:8f79:5d8f
00000918.00001528::2019/09/05-00:59:08.498 DBG [NETFTEVM] FTI NetFT event handler got event: Local endpoint fe80::14:a91:8f79:5d8f:~0~ connected
00000918.00001230::2019/09/05-00:59:08.498 DBG [NETFTEVM] FTI NetFT event dispatcher pushing event: Local endpoint fe80::14:a91:8f79:5d8f:~0~ connected
00000918.00001230::2019/09/05-00:59:08.498 DBG [FTI][Initiator] Got Netft event Local endpoint fe80::14:a91:8f79:5d8f:~0~ connected
00000918.00001528::2019/09/05-00:59:08.498 DBG [NETFTEVM] TM NetFT event handler got event: Local endpoint fe80::14:a91:8f79:5d8f:~0~ connected
00000918.0000096c::2019/09/05-00:59:08.498 DBG [NETFTEVM] TM NetFT event dispatcher pushing event: Local endpoint fe80::14:a91:8f79:5d8f:~0~ connected
00000918.0000096c::2019/09/05-00:59:08.498 INFO [IM] got event: Local endpoint fe80::14:a91:8f79:5d8f:~0~ connected
00000918.00001528::2019/09/05-00:59:08.498 DBG [WM] Filtering event NETFT_LOCAL_CONNECT? 1
00000918.0000155c::2019/09/05-00:59:08.503 INFO [NODE] Node 1: New join with n4: stage: 'Send Current Membership Status for Join Policy'
00000918.0000155c::2019/09/05-00:59:08.503 INFO [MM] Node 1: Adding a stream to existing node 4
00000918.0000155c::2019/09/05-00:59:08.503 INFO [NODE] Node 1: n4 node object adding stream
00000918.0000155c::2019/09/05-00:59:08.503 DBG [NODE] Node 1: n4 node object got a channel
00000918.0000155c::2019/09/05-00:59:08.503 DBG [NODE] Node 1: Using new stream to n4, setting epoch to 1
00000918.0000155c::2019/09/05-00:59:08.503 DBG [NODE] Node 1: Done closing stream to n4
00000918.0000155c::2019/09/05-00:59:08.503 DBG [NODE] Node 1: My Fault Tolerant Session Id is now 8d14d294-1635-416f-9e7c-44450c2a9cce
00000918.0000155c::2019/09/05-00:59:08.503 INFO [NODE] Node 1: No reconnect in progress to n4, updating send queue based on new stream.
00000918.0000155c::2019/09/05-00:59:08.503 DBG [NODE] Node 1: Treating stream with n4 as new connection because epoch (1) is <= 1.
00000918.0000155c::2019/09/05-00:59:08.503 INFO [MQ-Node1] Clearing 0 unsent and 0 unacknowledged messages.
00000918.0000155c::2019/09/05-00:59:08.503 INFO [NODE] Node 1: Highest version with n4 = Major 9 Minor 1 Upgrade 8 ClusterVersion 0x00090008, lowest = Major 8 Minor 9600 Upgrade 3 ClusterVersion 0x00080003
00000918.0000155c::2019/09/05-00:59:08.503 INFO [NODE] Node 1: Done processing new stream to n4.
00000918.0000155c::2019/09/05-00:59:08.503 DBG [CHANNEL 10.10.6.190:~3343~] Close().
00000918.000012f0::2019/09/05-00:59:08.503 INFO [RGP] node 1: Node Connected 4 00000000000000000000000000000000000000000000000000000000000010010
00000918.000012f0::2019/09/05-00:59:08.503 INFO [RGP] sending to node(4) 1: 001(1) => 001(1) +() -() [()] , ()
00000918.0000155c::2019/09/05-00:59:08.503 INFO [PULLER NODE1] Just about to start reading from <refcounted count='3' typeid='.?AVSimpleSecureStream@mscs_security@@'/>
00000918.0000155c::2019/09/05-00:59:08.503 INFO [RGP] node 1: received new information from 4 starting the timer
00000918.00001528::2019/09/05-00:59:08.798 INFO [RGP] node 1: Tick
00000918.00001528::2019/09/05-00:59:08.798 INFO [RGP] node 1: selected partition 10903(3 4) as node 4 has quorum
00000918.00001528::2019/09/05-00:59:08.798 INFO [RGP] node 1: selected partition 10903(3 4) to join [using info from 4]
00000918.00001528::2019/09/05-00:59:08.798 INFO [RGP] node 1: cannot join yet. no connection to (3)
00000918.00001528::2019/09/05-00:59:08.798 INFO [RGP] sending to all nodes 1: 001(1) => 001(1) +() -() [()] , ()
00000918.00001528::2019/09/05-00:59:08.798 DBG [NODE] Node 1: eating message sent to the dead node 3
00000918.00000dbc::2019/09/05-00:59:08.798 INFO [RGP] node 1: received new information from 1 starting the timer
00000918.00001528::2019/09/05-00:59:09.111 INFO [RGP] node 1: Tick
00000918.00001528::2019/09/05-00:59:09.111 INFO [RGP] node 1: selected partition 10903(3 4) as node 4 has quorum
00000918.00001528::2019/09/05-00:59:09.111 INFO [RGP] node 1: selected partition 10903(3 4) to join [using info from 4]
00000918.00001528::2019/09/05-00:59:09.111 INFO [RGP] node 1: cannot join yet. no connection to (3)
00000918.00001528::2019/09/05-00:59:09.111 INFO [RGP] sending to all nodes 1: 001(1) => 001(1) +() -() [()] , ()
00000918.00001528::2019/09/05-00:59:09.111 DBG [NODE] Node 1: eating message sent to the dead node 3
00000918.00001524::2019/09/05-00:59:10.507 DBG [NETFTAPI] received NsiParameterNotification for 169.254.93.143 (IpDadStateInvalid)
00000918.000015a4::2019/09/05-00:59:10.507 DBG [NETFTAPI] received NsiDeleteInstance for 169.254.93.143
00000918.000015a4::2019/09/05-00:59:10.507 WARN [NETFTAPI] Failed to query parameters for 169.254.93.143 (status 0x80070490)
00000918.000015a4::2019/09/05-00:59:10.507 DBG [NETFTAPI] Signaled NetftLocalAdd event for 169.254.93.143
00000918.000015a4::2019/09/05-00:59:10.507 DBG [NETFTEVM] FTI NetFT event handler ignoring PnP add event for IPv4 LinkLocal address 169.254.93.143:~0~
00000918.000015a4::2019/09/05-00:59:10.507 DBG [NETFTEVM] TM NetFT event handler ignoring PnP add event for IPv4 LinkLocal address 169.254.93.143:~0~
00000918.000015a4::2019/09/05-00:59:10.507 DBG [WM] Filtering event NETFT_LOCAL_ADD? 1
00000918.000015a4::2019/09/05-00:59:10.509 WARN [NETFTAPI] Failed to query parameters for 169.254.93.143 (status 0x80070490)
00000918.000015a4::2019/09/05-00:59:10.509 DBG [NETFTAPI] Signaled NetftLocalRemove event for 169.254.93.143
00000918.000015a4::2019/09/05-00:59:10.509 DBG [NETFTEVM] FTI NetFT event handler ignoring PnP remove event for IPv4 LinkLocal address 169.254.93.143:~0~
00000918.000015a4::2019/09/05-00:59:10.509 DBG [NETFTEVM] TM NetFT event handler ignoring PnP remove event for IPv4 LinkLocal address 169.254.93.143:~0~
00000918.000015a4::2019/09/05-00:59:10.509 DBG [WM] Filtering event NETFT_LOCAL_REMOVE? 1
00000918.000015a4::2019/09/05-00:59:10.509 DBG [NETFTAPI] received NsiParameterNotification for 169.254.1.68 (IpDadStatePreferred)
00000918.000015a4::2019/09/05-00:59:10.509 DBG [NETFTAPI] Signaled NetftLocalAdd event for 169.254.1.68
00000918.000015a4::2019/09/05-00:59:10.509 DBG [NETFTEVM] FTI NetFT event handler ignoring PnP add event for IPv4 LinkLocal address 169.254.1.68:~0~
00000918.000015a4::2019/09/05-00:59:10.509 DBG [NETFTEVM] TM NetFT event handler ignoring PnP add event for IPv4 LinkLocal address 169.254.1.68:~0~
00000918.000015a4::2019/09/05-00:59:10.509 DBG [WM] Filtering event NETFT_LOCAL_ADD? 1
00000918.000015a4::2019/09/05-00:59:10.509 DBG [NETFTAPI] Signaled NetftLocalConnect event for 169.254.1.68
00000918.000015a4::2019/09/05-00:59:10.509 DBG [NETFTEVM] FTI NetFT event handler got event: Local endpoint 169.254.1.68:~0~ connected
00000918.00001230::2019/09/05-00:59:10.509 DBG [NETFTEVM] FTI NetFT event dispatcher pushing event: Local endpoint 169.254.1.68:~0~ connected
00000918.00001230::2019/09/05-00:59:10.509 DBG [FTI][Initiator] Got Netft event Local endpoint 169.254.1.68:~0~ connected
00000918.000015a4::2019/09/05-00:59:10.509 DBG [NETFTEVM] TM NetFT event handler got event: Local endpoint 169.254.1.68:~0~ connected
00000918.0000096c::2019/09/05-00:59:10.509 DBG [NETFTEVM] TM NetFT event dispatcher pushing event: Local endpoint 169.254.1.68:~0~ connected
00000918.0000096c::2019/09/05-00:59:10.509 INFO [IM] got event: Local endpoint 169.254.1.68:~0~ connected
00000918.000015a4::2019/09/05-00:59:10.509 DBG [WM] Filtering event NETFT_LOCAL_CONNECT? 1
00000918.000015a4::2019/09/05-00:59:10.510 DBG [NETFTAPI] received NsiAddInstance for fe80::5efe:169.254.1.68
00000918.000015a4::2019/09/05-00:59:10.510 DBG [NETFTAPI] received NsiParameterNotification for fe80::5efe:169.254.1.68 (IpDadStateDeprecated)
00000918.0000152c::2019/09/05-00:59:17.174 DBG [CORE] WriteVersionFunctor: beginning write attempts
00000918.00001530::2019/09/05-00:59:37.222 DBG [NETFT] FTI NetFT event handler deregistration successful.
00000918.00001530::2019/09/05-00:59:37.222 INFO [NODE] Node 1: New join with n3: stage: 'Wait for Heartbeats on Initial NetFT Route' status (1460) reason: '[FTI][Initiator] Aborting connection because NetFT route to node NODE2 on virtual IP fe80::35c4:f902:cbd4:33ef:~3343~ has failed to come up.'
00000918.00001530::2019/09/05-00:59:37.276 INFO [CORE] Node 1: Clearing cookie e7920b13-f4cf-46bb-84ba-79562d7745d8
00000918.00001530::2019/09/05-00:59:37.276 INFO [CORE] Node 1: Cookie Cache 465e1aa8-175f-4879-a473-0ad991998962 [NODE1]
00000918.00001530::2019/09/05-00:59:37.276 DBG [CHANNEL 10.10.6.191:~3343~] Close().
00000918.00001530::2019/09/05-00:59:37.329 WARN cxl::ConnectWorker::operator (): (1460)' because of '[FTI][Initiator] Aborting connection because NetFT route to node NODE2 on virtual IP fe80::35c4:f902:cbd4:33ef:~3343~ has failed to come up.'

00000918.00001530::2019/09/05-01:00:07.531 DBG [JPM] Node 1: contacts size for node NODE2 is 1, current index 0
00000918.00001530::2019/09/05-01:00:07.531 DBG [JPM] Node 1: Trying to connect to node NODE2 (IP: 10.10.6.191:~0~)
00000918.00001530::2019/09/05-01:00:07.531 DBG [HM] Trying to connect to NODE2 at 10.10.6.191:~3343~
00000918.00001524::2019/09/05-01:00:07.547 INFO [CONNECT] 10.10.6.191:~3343~: Established connection to remote endpoint 10.10.6.191:~3343~.
00000918.00001524::2019/09/05-01:00:07.547 INFO [SV] New real route: local (10.13.6.200:~49794~) to remote NODE2 (10.10.6.191:~3343~).
00000918.00001524::2019/09/05-01:00:07.547 INFO [SV] Got a new outgoing stream to NODE2 at 10.10.6.191:~3343~
00000918.00001524::2019/09/05-01:00:07.547 DBG [SM] Joiner: Initialized with SPN = NODE2, RequiredCtxAttrib = 0, HandShakeTimeout = 40000
00000918.0000154c::2019/09/05-01:00:07.547 DBG [SM] Handling auth handshake posted by thread id 5412
00000918.0000154c::2019/09/05-01:00:07.547 DBG [SM] Joiner: Versions: 1-10
00000918.0000154c::2019/09/05-01:00:07.547 DBG [SM] Joiner: ISC returned status = 590610 output Blob size 1723, service principal name HOST/NODE2, auth type MSG_AUTH_PACKAGE::KerberosAuth, attr: 83998
00000918.0000154c::2019/09/05-01:00:07.547 DBG [SM] Joiner: Sending SSPI blob of size 1723 to Sponsor
00000918.0000154c::2019/09/05-01:00:07.563 DBG [SM] Joiner: Switching to Schannel
00000918.00001524::2019/09/05-01:00:07.578 DBG [Schannel] Client: Chosen Cert's version = 2, serialNo = <vector len='16'>00000918.00001524::2019/09/05-01:00:07.735 INFO [SV] Authentication and authorization were successful
00000918.00001524::2019/09/05-01:00:07.735 INFO [VER] Got new TCP connection. Exchanging version data.
00000918.00001524::2019/09/05-01:00:07.735 DBG [VER] Calculated cluster versions: highest [Major 9 Minor 1 Upgrade 8 ClusterVersion 0x00090008], lowest [Major 8 Minor 9600 Upgrade 3 ClusterVersion 0x00080003] with exclude node list: (3)
00000918.00001524::2019/09/05-01:00:07.735 INFO [VER] Checking version compatibility for node NODE2 id 3 with following versions: highest [Major 9 Minor 1 Upgrade 8 ClusterVersion 0x00090008], lowest [Major 8 Minor 9600 Upgrade 3 ClusterVersion 0x00080003].
00000918.00001524::2019/09/05-01:00:07.735 INFO [VER] Version check passed: node and cluster highest supported versions match. Other node still supports lower level, so joining in downlevel mode.
00000918.00001524::2019/09/05-01:00:07.735 INFO mscs::VersionManagerAgent::IsCompatible: First run: setting CFL to 8.3 manually instead of looking for value in database
00000918.00001524::2019/09/05-01:00:07.735 DBG [CORE-Dbg] IsCompatible: setting operating version to 8.3 on first run
00000918.00001524::2019/09/05-01:00:07.750 INFO [SV] Negotiating message security level.
00000918.00001524::2019/09/05-01:00:07.750 INFO [SV] Already protecting connection with message security level 'Sign'.
00000918.00001524::2019/09/05-01:00:07.750 INFO [FTI] Got new raw TCP/IP connection.
00000918.00001524::2019/09/05-01:00:07.765 INFO [FTI][Initiator] This node (1) is initiator
00000918.00001524::2019/09/05-01:00:07.765 DBG [FTI][Initiator] Cookie for remote node is e7920b13-f4cf-46bb-84ba-79562d7745d8
00000918.00001524::2019/09/05-01:00:07.765 DBG [FTI] Stream already exists to node 3: false
00000918.00001524::2019/09/05-01:00:07.783 INFO [FTI][Initiator] Trying to select best endpoints among 169.254.1.68:~3343~, fe80::14:a91:8f79:5d8f:~3343~ (first pair) and 169.254.3.177:~3343~, fe80::35c4:f902:cbd4:33ef:~3343~ (second pair)
00000918.00001524::2019/09/05-01:00:07.785 INFO [HM] Marking route from realLocal 10.13.6.200:~49794~ -> realRemote 10.10.6.191:~3343~ as a cross-subnet route
00000918.00001524::2019/09/05-01:00:07.785 INFO [RouteDb] Route virtual fe80::14:a91:8f79:5d8f:~0~ to virtual fe80::35c4:f902:cbd4:33ef:~0~ added
00000918.00001524::2019/09/05-01:00:07.785 DBG [NETFT] Removing route <struct mscs::FaultTolerantRoute>
00000918.00001524::2019/09/05-01:00:07.785 DBG <realLocal>10.13.6.200:~3343~</realLocal>
00000918.00001524::2019/09/05-01:00:07.785 DBG <realRemote>10.10.6.191:~3343~</realRemote>
00000918.00001524::2019/09/05-01:00:07.785 DBG <virtualLocal>fe80::14:a91:8f79:5d8f:~0~</virtualLocal>
00000918.00001524::2019/09/05-01:00:07.785 DBG <virtualRemote>fe80::35c4:f902:cbd4:33ef:~0~</virtualRemote>
00000918.00001524::2019/09/05-01:00:07.785 DBG <Delay>1000</Delay>

Charles Peter

↧

Storage space direct NIC setup

September 5, 2019, 12:08 pm

≫ Next: Cluster Disk Driver Letter Fiasco

≪ Previous: Unable to add the node in multi subnet cluster

We have a Failover cluster of 3 nodes each node having 2 x 1Gbps NICs which are teamed and used for management and VM traffic on the front (the team is also a trunk carrying traffic for multiple vlans). Team is a switch dependant LACP setup and works fine.

The backend storage is storage space direct and the NICs were not setup by us. I am struggling to understand how they have been setup and how they are meant to work. The setup is -

2 x 10Gbps NICs per node but not teamed as far as I can tell

There are 2 storage vlans and each NIC is a trunk allowing both vlans.

Each vlan is associated with a different IP subnet and NIC1 has an IP from one of the subnets with NIC2 having an IP from the other subnet.

So basically -

NIC1 - allows vlans 10 and 11 with an IP of 192.168.3.1/27 from vlan 10

NIC2 - allows vlans 10 and 11 with an IP of 192.168.4.1/27 from vlan 11

Does this sound right to anybody because coming from a network background if it is trunk I would expect a vswitch somewhere ie. as in the front end. And I just don't understand how this is meant to work as we have been told both storage NICs are active on each node.

Any insights would be very much appreciated.

↧

Cluster Disk Driver Letter Fiasco

September 9, 2019, 10:01 am

≫ Next: logon failure when accessing cluster file resources

≪ Previous: Storage space direct NIC setup

I have a 4 node cluster and my Cluster Disk on node1 is assigned a drive letter D:

If I fail over my node1 to node2, the Drive Letter from Node1 gets assigned to Node2 and the Node2 Drive letter D: disappears

I am bit concerned, if i have something running on Node2 on Drive D: that will get lost?

I know in clustering u have to resreve the Drive Letters so that no other node uses those drive letter; but whats the solution to this in 2019; I dont want to use CSV

Thanks

↧

logon failure when accessing cluster file resources

September 11, 2019, 12:08 pm

≫ Next: Cluster Aware Update (CAU) on Storage Spaces Direct (S2D) with Pre-staged Virtual Cluster Object (VCO)

≪ Previous: Cluster Disk Driver Letter Fiasco

running a 4 node nested vm hyper-converged failover cluster. On one of my nodes, i can not acceess the shared cluster resources.

Logon failure: the user has not been granted the requested logon type at this computer

i have 2 volumes created w/ s2d across these four nodes, i get this error every time i try accessing them via navigating file explorer to c:\clusterstorage\volum.....

IT guy

↧

Cluster Aware Update (CAU) on Storage Spaces Direct (S2D) with Pre-staged Virtual Cluster Object (VCO)

September 11, 2019, 12:46 pm

≫ Next: Windows 2012 R2 rolling upgrade to 2016 file server

≪ Previous: logon failure when accessing cluster file resources

I have been running into a bug with CAU in RS1-14393 where it doesn’t accept the Pre-staged AD Object (fails both as Powershell parameter and GUI config), and instead tries to generate/submit a new randomized AD Object (example: CAU-81ea8e) to the domain controller to run the CAU from. Problem is, it doesn’t have permission to the AD domain controller (this is not my domain controller), and so it fails, but still tries to use the CAU object even though it was not correctly created in AD.

Here’s the part I’m stuck on:

https://docs.microsoft.com/en-us/windows-server/failover-clustering/cluster-aware-updating-requirements#additional-recommendations

“To configure CAU in self-updating mode, a virtual computer object (VCO) for the CAU clustered role must be created in Active Directory. CAU can create this object automatically at the time that the CAU clustered role is added, if the failover cluster has sufficient permissions. However, because of the security policies in certain organizations, it may be necessary to prestage the object in Active Directory. For a procedure to do this, see Steps for prestaging an account for a clustered role.”

The cluster object and cluster group both have Full Control permissions to the VCO, but the cluster still insists on trying to create a new randomized cluster object when I try to setup CAU.

I found the following technet article regarding CAU: https://social.technet.microsoft.com/Forums/windowsserver/en-US/a7a0d434-cd37-4592-a1f5-6d85ae4e1797/storage-spaces-direct-cluster-aware-updating-behaviour?forum=winserverfiles

This is the current procedure we use to run Windows Updates, which is all manual per-node: https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/maintain-servers. This procedure can take up to 2-3 weeks of manual work to patch the full 8 node cluster, waiting for CSV disk regeneration between each node reboot.

I’m still waiting for my IT organization to certify Server 2019 (RS5-17763) for production use, which is why I’m still using Server 2016 (RS1-14393) on all my S2D clusters that I am deploying, or I would upgrade to Server 2019 already.

If you have any additional data points you can share, or if you know of another forum who might use CAU and have some insight, I would be thankful for the assistance.

↧