SAN HPE SV3200 iScsiIPrt errors crashing VMs and after cascade also Fail Over Server Nodes?

January 9, 2019, 1:37 am

≫ Next: Migrate Sql 2008 Cluster

≪ Previous: Error Code: 0x8007054f when rename a SQL Server client access point of a SQL Server cluster

We have a three node W2012 R2 Fail over cluster that has been running spotless for years with the HPE P4300 SAN but after adding the HPE Storevirtual SV3200 as a new SAN we are having iScsiPrt errors that HPE Support cannot fix, crashing VMs and also two of the three fail over nodes.

At first everything seemed to work, but after adding additional disks on the SAN a SAN controller crashed. That has been replaced under warranty but now when moving our servers and especially SQL 2008 Servers to the SAN, problems start to occur. The VHDX volumes of the SQL servers are thin provisioned.

Live moving of the storage worked fine for none SQL servers. For some SQL servers the servers frooze and operation was halted, so we needed to perform an offline move. Then during high disk IO and especially during backups W2012 R2 FOC started to behave erratic eventually crashing VMs and in one instance rebooting two fail over nodes, as a result of a flood of iScsciPrt errors in the eventlog:

System iScsiPrt event ID 27 error Initiator could not find a match for the initiator task tag in the received PDU. Dump data contains the entire iSCSI header.
System iScsiPrt event 129 warning The description for Event ID 129 from source iScsiPrt cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\RaidPort4

the message resource is present but the message is not found in the string/message table

System iScsiPrt event ID 39 error Initiator sent a task management command to reset the target. The target name is given in the dump data.
System iScsiPrt event ID 9 error Target did not respond in time for a SCSI request. The CDB is given in the dump data.
System iScsiPrt event 129 warning The description for Event ID 129 from source iScsiPrt cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\RaidPort4

the message resource is present but the message is not found in the string/message table
System iScsiPrt event ID 27 error Initiator could not find a match for the initiator task tag in the received PDU. Dump data contains the entire iSCSI header.
System FailOverClustering event id 5121 Information Cluster Shared Volume 'Volume4' ('NEMCL01_CSV04') is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node's connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

After a 2 hour period of these events the FailOver Cluster services started to give errors, VMs failed and finally 2 nodes of our 3 node failover cluster rebooted because of a crash.

Sofar HPE has not been able to fix this. The SV3200 logs has occasional ISCSI controller errors but the error logging in the SVMC is minimal.

HPE support blamed using a VIP and using Sites (a label). Both are supported according to the HPE product documentation. This has been removed and ISCSI initiator has been set to the Eth0 bond IP adresses directly. As problems persist they blamed that we are using the Lefthand DSM MPIO driver on the initiator connections to the SV3200 which is not the case. Standard MS DSM. Yes the Lefthand driver is on the system for our old SAN but not configured for the SV3200 initiator sessions, which is round robin with supset.

We are currently facing a legal warranty standoff.

Any pointers or other comparable experiences with the HPE Storevirtual SV3200 SAN?

TIA,

Fred

↧

Migrate Sql 2008 Cluster

September 26, 2019, 12:46 pm

≫ Next: Cluster network name resource failed registration of one or more associated DNS name(s) for the following reason: DNS bad key. Event ID 1196

≪ Previous: SAN HPE SV3200 iScsiIPrt errors crashing VMs and after cascade also Fail Over Server Nodes?

Hi Folks

I am kind of stuck with the below,

1) Procedure to migrate 2008 SQL Cluster VM (connected to Dell Equallogic 4120E ISCSI 1TB LUN) to SQL 2017 Hyper-v Cluster.

2) Procedure to migrate 2008 VM cluster to 2019 Hyper-v Cluster will be use Microsoft migration tool to migrate to 2012 and then from there perform online migration to 2019?

Appreciate if some one can help on this.

↧

Cluster network name resource failed registration of one or more associated DNS name(s) for the following reason: DNS bad key. Event ID 1196

August 13, 2018, 11:56 am

≫ Next: server 2012 Hyper-V CSV volume reporting incorrect space

≪ Previous: Migrate Sql 2008 Cluster

Hello,

We have a server 2016 failover cluster with 1 clustered role on it. In the event log of the cluster we are getting the following:

"Cluster network name resource "SQL Network Name (SQLSRVIT)" failed registration of one or more associated DNS name(s) for the following reason: DNS bad key."

I have checked the DNS server settings in the NIC and they are pointing to a valid working DNS server.

This error happens every 15 minutes.

What else can I check to try to fix this?

Thanks

James

↧

server 2012 Hyper-V CSV volume reporting incorrect space

September 1, 2016, 9:24 am

≫ Next: Drain role Failed

≪ Previous: Cluster network name resource failed registration of one or more associated DNS name(s) for the following reason: DNS bad key. Event ID 1196

I have a 3-node server 2012 Hyper-V cluster with 10 CSV's.

Our main CSV (3.5TB in size holding the majority of our VM's) seems to be increasing in used space even though there have been no VM's added or any changes to it.

A month ago this same CSV ran out of space so all the VM's ground to a halt.

This was most unexpected because, as I said above, there have been no changes to the CSV or the VM's contained within.
During this space outage I moved some VM's around and re-booted one off the nodes. suddenly the CSV free space jumped to 1.7TB free, which is what I expect but strange as what caused the free space to be used up.

now a month later the free space is decreasing again (now down from 1.7TB to 970GB). again there has been no changes to the VM's on the CSV

windirstat puts used pace for the volume at 1.8TB but looking at cluster properties in windows explorer the used space is 2.58TB.

I do not know why this is.

Anyone have any ideas other than reboot the nodes to see if this fixes?

thanks

↧

Drain role Failed

March 10, 2019, 4:01 am

≫ Next: netft.sys is the cause for the bugchk blue screen on the server Windows 2008 R2 Datacenter

≪ Previous: server 2012 Hyper-V CSV volume reporting incorrect space

We have three Node N-1, N-2, N-3. I drain role from N-2 and there10 VM's Moved out of 14. 4 VM are not moving getting error . Tried to move manually but the error same. Please assists All the Node in are WIN-2012 R2

Error Message : "operation did not complete on resource virtual machine live migration"

↧

netft.sys is the cause for the bugchk blue screen on the server Windows 2008 R2 Datacenter

March 8, 2013, 11:59 am

≫ Next: Different amounts of RAM in Hyper-V Hosts

≪ Previous: Drain role Failed

we have the server geting rebooted by a bugchk error for netft.sysPlease let me know if we have any fix for this issue. i am not sure wht is causing the issue on the server

the server is windows 2008 R2 Datacenter and it is on the HyperV cluster

Thanks in advance

↧

Different amounts of RAM in Hyper-V Hosts

October 2, 2019, 1:22 am

≫ Next: Cluster Aware Update

≪ Previous: netft.sys is the cause for the bugchk blue screen on the server Windows 2008 R2 Datacenter

Hi there,

I have a client with a Windows Server 2016 Hyper-V failover cluster consisting of 2 DL 380's with 1024 GB RAM on each.

The client is running out of CPU resources and is considering buying a new server (another DL 380) to join to the cluster. Is it necessary to have the same amount of RAM (1024 GB) on the new host or can we install less RAM?

Will installing less RAM on the additional cause the cluster validation wizard to fail and will this configuration be supported by Microsoft? I cannot seem to find any official guidance.

Thanks.

↧

Cluster Aware Update

October 2, 2019, 5:29 am

≫ Next: Error applying Replication Configuration Windows Server 2019 Hyper-V Replica Broker

≪ Previous: Different amounts of RAM in Hyper-V Hosts

Hi,

I have Windows Server 2012 R2 Cluster having 3 nodes and 15 to 15 VMs over Hyper-V Cluster, Normally for Windows update we use local WSUS. Firstly we download update on each cluster machine, install updates and reboot if required and repeat same procedure step by step for each cluster node.

Can i use CLUSTER AWARE UPDATE mechanism to update my Cluster Nodes, please note that i install security updates, update roll-ups and etc.

Please comments

↧

Error applying Replication Configuration Windows Server 2019 Hyper-V Replica Broker

August 9, 2019, 2:48 am

≫ Next: if resources fails, attempt restart on current node

≪ Previous: Cluster Aware Update

Hello,

Recently we started replacing our Windows Server 2016 Hyper-V Clusters for Server 2019. On each cluster we have a Hyper-V Replica broker that allows replication from any authenticated server and stores the Replica Files to a default location of one of the Cluster Shared Volumes.

With WS2019 we run into the issue where we get an error applying the Replication Configuration settings. The error is as follows:
Error applying Replication Configuration changes. Unable to open specified location for replication storage. Failed to add authorization entry. Unable to open specified location to store Replica files 'C:\ClusterStorage\volume1\'. Error: 0x80070057 (One or more arguments are invalid).

When we target the default location to a CSV where the owner node is the same as the owner node for the Broker role we don't get this error. However I don't expect this to work in production (moving roles to other nodes).

Did anyone ran into the same issue, and what might be a solution for this? Did anything changed between WS2016 & WS2019 what might cause this?

Kind regards,

Malcolm

↧

if resources fails, attempt restart on current node

October 3, 2019, 5:38 am

≫ Next: WSSD vs. Azure Stack HCI certification

≪ Previous: Error applying Replication Configuration Windows Server 2019 Hyper-V Replica Broker

Period For Restarts

Maximum Restarts in the specified period

I am struggling to find anything that explains what this functionality means.

If I set the maximum restarts to 3, then does the cluster try to start the affected service 3 times before failing over? Do these 3 restarts happen immediately after each other, or is there some wait time built in?

How does the Period for restarts impact on the activities?

↧

WSSD vs. Azure Stack HCI certification

October 4, 2019, 11:28 am

≫ Next: Rebooting Server 2016 SQL Failover Cluster Node results in Blue Screen 0x0D1 after trying to recover cluster state upon booting up

≪ Previous: if resources fails, attempt restart on current node

A team member and I are having a debate. We want to know if it is "safe" to use the very recently released Lenovo SR635 or SR655 EPYC based servers, for building our own Win2019 Storage Spaces Direct cluster (all cluster components will be Windows certified).

The servers are listed in the Windows Server Catalog as Win2019 with Software-Defined Data Center (SDDC) Premium certification (SR635, SR655).

They are not listed in in the Azure Stack HCI Catalog.

He firmly believes that the systems needs to be in Azure Stack HCI catalog, in order to proceed

based on this PDF from Lenovo Certified Config for Microsoft S2D

I believe that we can use the servers

The S2D Hardware Requirements page, used to state of that only Software-Defined Data Center (SDDC) certification is required(this changed in August ;-[).
I look at the Lenovo doc as a list of configs that Lenovo will support (FYI, these servers were they released after the PDF was published)
The PDF is not a list of systems that can used for S2D, if we are the one supporting the cluster/solution.

So, which of us I "right"?

Regardless of "right", would you proceed anyways?

↧

Rebooting Server 2016 SQL Failover Cluster Node results in Blue Screen 0x0D1 after trying to recover cluster state upon booting up

October 5, 2019, 10:16 am

≫ Next: SQL Server Virtual Machine Live Migration Too Slow

≪ Previous: WSSD vs. Azure Stack HCI certification

Hello,

I have an odd one. While a node is live, without draining or removing from the cluster we do the following:

1. Reboot it

2. Upon coming back up, sign in

3. Within a minute itll bluescreen

4. Boot back up, sign in, everything is fine

The dump shows ntoskrnl.exe DRIVER_IRQL_LESS_THAN_OR_EQUAL 0x000000D1

If you check the cluster operational, youll see it start some GUM Process with GrantLock, Process Request lock. This happens over and over until it bluescreens. Subsequent reboot from bluescreen shows GUM but it only does the "processing locally". Events below:

Preceeding Bluescreen(these repeated over and over and were even suppressed per application log):

[GUM] Node 2: Processing RequestLock 4:595
[GUM] Node 2: Processing GrantLock to 4 (sent by 5 gumid: 20121)

Post Bluescreen (note these still showed pre-bluescreen above but rarely):

[GUM]Node 2: Executing locally gumId: 20121, updates: 1, first action: /dm/update

Before the bluescreen in the event viewer the following happens with the NIC. Keep in mind this NIC is apart of a team. 2 of the 4 team members are down (waiting to be plugged in if the others die) and 2 are live. This team is handled by the OS in Server Manager. We are using Intel drivers not system drivers. Latest.

Reboot - 9:51
Kernel Power Hardware Notifications upon boot up
Connectivity state in standby: Disconnected, Reason: NIC compliance - 9:54

both adapters come online - 9:54
Intel® Ethernet 10G 4P X520/I350 rNDC
Network link has been established at 10Gbps full duplex.

and

Intel® Ethernet 10G 2P X520 Adapter #2
Network link has been established at 10Gbps full duplex.

===============================

NIC report disconnected

Intel® Ethernet 10G 2P X520 Adapter
Network link is disconnected.

Intel® Ethernet 10G 4P X520/I350 rNDC #2
Network link is disconnected.

MsLbfoSys

Member Nic {30793b81-07bd-4afe-85f6-6dd873581384} Connected.

NIC Disconnects again

Intel® Ethernet 10G 4P X520/I350 rNDC
Network link is disconnected.

NICs reconnect

Intel® Ethernet 10G 4P X520/I350 rNDC
Network link has been established at 10Gbps full duplex.

MsLbfoSys

Member Nic {7947a925-563e-4bf8-b3c6-73c46ef2d4ed} Connected.

DNS Resolution and Domain Resolution fail - 9:55

lphplsvc reports that network is coming up - 9:55

At this point you can sign into the server and shortly there after itll RDP. I have not yet tested it but I believe it will also bluescreen without signing in, im just relaying the recent event. This doesnt happen everytime but is a 50/50. Ill test in my lab this coming week to reproduce. Anything additional I should capture?

I see a hotfix for this issue 0x0D1 for server 2012 but this 2016. I have a feeling that the Network coming up causes Windows or the Cluster to grab the address space for the driver and then the opposite one tries for it upon network recovery above but it fails to release the address space. I am assuming the cluster is snagging it then windows is trying after, thus ntokrnl.exe being at fault.

Any input would be great, this is an odd one and im hoping to track it down. I understand that delaying the startup of SQL services might be a suggestion but I mixed reviews on doing that and being that it seems like cluster activity not so much SQL, im wondering if that is even an option here.

↧

SQL Server Virtual Machine Live Migration Too Slow

September 16, 2019, 12:32 pm

≫ Next: Firewall ports Failover Clustering in Server 2016

≪ Previous: Rebooting Server 2016 SQL Failover Cluster Node results in Blue Screen 0x0D1 after trying to recover cluster state upon booting up

Hi, I have a two node, Hyper-V Cluster with about a dozen VMs. As time went by migrating between nodes have slowed. Specifically a SQL Server with about 300 GB disk takes about four minutes to move from one node two the other, and during the process I see it's status as "shutting down" making it offline for a noticeable while.

The servers are Dell PowerEdge 530s with a Compellent (Dell) SC2020 with low latency. I recently connect to 10 GB NIS together peer to peer as a cluster only network in hopes to reduce migration time. It did not seem to make any difference.

Anyone have ideas on why the migration is slow and how to speed it up?

Thanks,

Ken

↧

Firewall ports Failover Clustering in Server 2016

May 30, 2017, 8:58 am

≫ Next: Multiple SQL Cluster (2008, 2012, 2016) on a Single Windows Cluster

≪ Previous: SQL Server Virtual Machine Live Migration Too Slow

Hello - I'm configuring MS Failover Cluster across two datacenters with different IP Ranges using server 2016. What firewall ports are needed to setup two nodes cluster and witness file share ?

Thanks

↧

Multiple SQL Cluster (2008, 2012, 2016) on a Single Windows Cluster

September 12, 2019, 2:19 am

≫ Next: Microsoft Network Load Balancing not working as expected

≪ Previous: Firewall ports Failover Clustering in Server 2016

Hi Experts,

Can I have multiple SQL Cluster configured on a single environment (Windows Cluster?

For example - I have a 5 Node Windows 2012 R2 Windows Failover Cluster on which I want to configure 2 nodes to run as SQL 2008 Cluster, 4 nodes as SQL 2012 Cluster & 2 Nodes with 2016 Cluster.

The combination of nodes can be in any number (2 - SQL 2012, 4 - SQL 2008 or 5 - SQL 2016) but underlying I have only 5 nodes that will form the cluster. So, a single CNO (of 5 nodes) under which there will be multiple SQL Clusters (of 2, 3, 4 or 5 Nodes).

Is that possible? Above all, is that a recommended & supported scenario?

Thanks!

↧

Microsoft Network Load Balancing not working as expected

September 18, 2019, 2:45 am

≫ Next: Storage Space Direct - Storage Jobs taking too long

≪ Previous: Multiple SQL Cluster (2008, 2012, 2016) on a Single Windows Cluster

I wish to have a failover cluster for an IIS site in my domain.
I have configured the cluster on port 80, however only once the network of that specific node is down will the cluster detect that node is down.
If I stop the site through IIS manager that node is still considered healthy.
What am I doing wrong? Is this what do the product supposed to do? If not what other product can help me?

↧

Storage Space Direct - Storage Jobs taking too long

September 16, 2019, 1:12 pm

≫ Next: NLB - only one hosts gets hit

≪ Previous: Microsoft Network Load Balancing not working as expected

Hi,

I have four node S2D cluster with 2019 OS. This is lab environment with SSD+HDD disks and 10Gbps network. Cluster is working fine, I have great performances but there is one issue that bothers me.

Every time I patch nodes, and I restart one of the nodes I have to wait for Storage Jobs to finish (so that data is synced). I know that I have to wait this out and I monitor this with get-storagejob command. When jobs finish, I continue updating other server nodes. What bothers me is time I have to wait for these jobs to finish. They can take more than 60-80min.

Is there any way to speed this process up? I see that network and disks are not utilized fully when storage jobs are being done. I found online that this is intentionally throttled by Microsoft not to mess with VM performances. However, since this is Lab environment, I would love to speed this process up if possible, because in this way, patching takes to much time...

BR,

Soko

↧

NLB - only one hosts gets hit

October 8, 2019, 3:07 am

≫ Next: Stuck with Redirected Access?

≪ Previous: Storage Space Direct - Storage Jobs taking too long

Hello,

I'm struggling very hard to set up a VPN solution with NLB enabled using 2x Server 2k16.

For some reason, after I create the cluster, the VPN Client can only connect to one of the servers. If I stop it, the connection is not possible. The cluster is in Multicast mode and I have confirmed with the network team that on the perimeter firewall the MAC address of the cluster is in the ARP table with the correct cluster IP.

I've already tried stripping down and re-creating the cluster but always get the same results.

Does anyone have any ideas what I need to check?

Kind regards,

Wojciech

↧

Stuck with Redirected Access?

April 1, 2016, 9:42 am

≫ Next: Cluster events not written to system event log

≪ Previous: NLB - only one hosts gets hit

I'm running a HyperV Cluster on 2012 R2.

Each node is connected to the "network" with 2 nics (LACP) and to the ISCSINetwork with 4 nics, using MPIO. (usually)
For one node, the 4-nic-card just died, so we temporary replaced it with a single-link-card.

The node is correctly connected to the ISCSI-Target again.

However, the node "refuses" to use that single-link to talk with the ISCSI-Target. Access is always redirected through the "network" and the node with 4 active nics.

Is it possible, that - due to the LACP connection says "2 Gbit" - the cluster is "thinking" that it's faster to redirect CSV-Traffig to node1 and then use the 4 nic ISCSI over there?

Is there a way to "force" node 2 utilizing the "single-Link" instead?

ps.: The CSV is reportind redirected Access as DISABLED.

↧

Cluster events not written to system event log

May 5, 2011, 11:42 am

≫ Next: SCOM2016 monitoring S2D General File Server

≪ Previous: Stuck with Redirected Access?

Hi,

I have a Windows Server 2008 R2 cluster that is not writing cluster events to the system event log. When I trigger a failover event, the failover happens successfully but nothing is logged to the system event log.

Is there a way that I can fix this?

Thanks,

Howard

↧