I followed the instructions on the Microsoft Blog - Failover Clustering Sets for Start Ordering, and it works great for starting the VMs in the correct order. However, it causes problems with live migration.
The problem is best described with an easy-to-reproduce example:
Setup
- Create 3 VMs (DC-Server, DB-Server, Web-Server). There doesn't need to be any OSes installed for this testing.
- Create 3 ClusterGroupSets: DC-Set, DB-Set, Web-Set
- Add the appropriate VM to each set
- Make the Web-Set dependent on the DB-Set.
- Make the DB-Set dependent on the DC-Set.
Test
- Start the Web-Server VM.
- Web-Server will change to "Starting", and DC-Server will start.
- 20 Seconds later DB-Server will start
- 20 Seconds later, Web-Server will start.
- Good!
Problem
- Select the three, running VMs and live migrate them.
- DC-Server will migrate immediately and correctly.
- DB-Server and Web-Server will get stuck at 50-80%. While stuck, the VMs are not on the network.
- Sometimes, after 5+ minutes, the migration will complete. But it usually errors out.
- You can manually cancel the live migration, but there is still an outage.
We first noticed this problem when we paused a cluster node to perform maintenance. It failed to pause. This also causes CAU to fail to drain nodes.
If you manually live migrate Web-Server, then DB-Server, then DC-Server, it works.
My Best Guess
When the DC-Server VM is live migrated, it is treated as a "fresh startup", and the StartupDelay causes the dependent Sets to wait to start the VMs for 20 seconds. Hence, they get stuck near the end of migration. I don't know why they just don't resume migrating after 20 seconds though.
Summary
Cluster Groups Sets are a must-have feature, but so is live migration. I need both to work.
-Tony