Quantcast
Channel: High Availability (Clustering) forum
Viewing all articles
Browse latest Browse all 5654

Scale Out File Server SMB redirection locking up CSVs

$
0
0

Problem - Physical hosts have HyperV running and a vhdx located in a SOFS CSV (HyperV hosts different than SOFS cluster nodes).  During start up of the VM when SMB redirection occurs or when trying to move CSVs with an active SMB connection between cluster nodes locks up the CSV.  

All physical hosts and VMs are Windows 2012 R2 with updates to ~July 2016
All physical hosts are Cisco C220s with latest OS updates and 1 update behind on firmware
SOFS is a two physical node cluster with SAS connected JBOD
4 CSVs exist, all exhibiting the same issue
SOFS cluster nodes have the below networks:
Mgmt - teamed 10G - no cluster use
cluster0 - single 10G nic - cluster only
cluster1 - single 10G nic - cluster only
SOFS0 - single 10G nic - cluster/client
SOFS1 - single 10G nic - cluster/client (currently set to none for troubleshooting)
Backup - Teamed 10G - no cluster use
LiveMigration - Teamed 10G no cluster use/only network for live migrations
Cluster validation runs clean
When nothing is connected to the CSV shares I can fail CSVs and SOFS role without any errors
Currently each CSV is used by a single HyperV server and has a single vhdx in it.

HyperV host networks
SOFS0 - single 10g nic
SOFS1 - single 10g nic
Backup Team
Mgmt Team
Customer Network Team

I believe both problems are related;
Problem 1)
CSV share is owned by SOFSA
When I boot a VM with a secondary vhdx located in SOFS (OS is in local RAID disk), checking the SMBClient logs on HyperV host and SMBServer logs on SOFS hosts I can see:
HyperV host hits SOFSB.  
HyperV host connects and share is seen as asymmetric/continuous availability transfer.  Witness registration completes.  
SOFSB issues redirect to SOFSA.  
HyperV host gets redirection request and establishes connection to SOFSA (4 event log messages, SMB client reconnect, session reconnect, share reconnect and witness registration). 
At the same second as the previous 4 SMB reconnect messages, but last in sequence. so the 5th message, a message is received to redirect to another cluster node.
HyperV looses session and share during reconnect and SMB Client successfully moved, but no messages on session or share reconnect.
After 59 seconds on the SOFSA I have errors the re-open failed (event id 1016), client session expired
After 60 seconds HyperV registers a request timeout due to no response from server.  Server is responding to TCP but not SMB (event id 30809)
HyperV host then immediately registers a connections to SOFSB for the share, goes through the same redirection sequence to SOFSA (who owns the share).  SMB Client, session reconnect, share reconnect, witness registration successful.
2 seconds later on SOFSA I have a reopened failed, the file is temporarily unavailable (event ID 1016)  I can see the source/destination/share that matches with what is occurring.  Error just continues every 5 seconds.
If I go and try to 'inspect' the drive from HyperV it times out and on SOFSA I get a warning (event ID 30805) client lost its session - Error {Network Name Not Found} - The specified share name can not be found share name \SOFSClusterName\$IPC
Now we just repeat errors client established session to server, lost session to server network name not found server \SOFSClusterName - same session ID in connect/disconnect for each pair of connect/disconnect

Now the great part - 
If I go into failover cluster (FOC) and I try to move the CSV to the other node, the CSV gets stuck in pending offilne.  After a few minutes any other CSVs owned by the same node go into pending offline and hang.  I can reboot and wait 10 minutes for it to finally die and failover or wait 20 for FOC to completely die on both nodes of the cluster.  In the cluster logs, the SOFS node is never fully releasing the CSV to move.  The last message you will see related to teh volume is:
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning from 4 to 2.
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} moved to state 2. Reson 7; Status 0x0.
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning from 2 to 1.

Normally you see :
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning from 4 to 2.
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} moved to state 2. Reson 7; Status 0x0.
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning from 2 to 1.
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} moved to state 1. Reson 5; Status 0x0.
Volume4; Volume target path \??\GLOBALROOT\Device\Harddisk39\ClusterPartition1; File System target path \??\GLOBALROOT\Device\Harddisk39\ClusterPartition1.
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning from 1 to SetDownlevel. Local true; Flags 0x1; CountersName
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} moved to state 3. Reson 3; Status 0x0.
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning from 3 to 4.
Volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} moved to state 4. Reson 4; Status 0x0.

Issue is consistent across all 4 CSVs I have.  I believe the issue has always existed.  If I get the HyperV hosts lined up right to initially hit the SOFS server that owns the CSV, everything boots up fine.  When it doesn't VMs and FOC hangs and I have to go through reboots and VMs loose their drives and I have to reboot those as well. It only when it gets redirected to a different SOFS server that the issue comes up which leads me to the next problem.

Problem2: 
Assuming all the VMs connected to the right SOFS CSV owner on boot and everyone is running/working fine for days/weeks/months (yes this has been sitting around for a while as unresolved problem).  If I try and move a CSV for SOFS maintenance purposes the CSV hangs in offline pending.  Eventually the FOC hangs and I have to spend 2 hours to get things lined up right (after I do what ever I was planning on doing) so the VMs boot.

Things done/verified
Windows firewall is off
I've turned off IPv6
Removed Teaming from all nodes using SOFS0/1 network and cluster0/1 (used to be windows team vs individual networks)
Turned off client/network access from SOFS1 network
turned off CSV balancer - hindsight doesn't work without it due to redirection of CSVs due to asymentic storage
updated permissions for SOFS share to include HyperV host, SOFS cluster nodes - didn't make any difference/never see access denied errors

One item I see I don't understand is on the SOFS cluster nodes, in SMBClient/connectivity logs, I see network connection failed to the cluster adddresses:

The network connection failed.
Error: {Device Timeout}
The specified I/O operation on %hs was not completed before the time-out period expired.
Server name: fe80::98f9:c138:xxxxx%32
Server address: x.x.x.x:445
Connection type: Wsk
Guidance:
This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks port 445 or 5445 can also cause this issue.

The server name is the 'Tunnel adapter Local Area Connection* 12:' on the other SOSF cluster node.  So SOFSA generating errors to SOFSB and SOFSB generating errors connecting to SOFSA.   This was occuring before and after the cluster0/1 network interfaces were teamed



Thanks-









Viewing all articles
Browse latest Browse all 5654

Trending Articles