Microsoft Clustering Services (MSCS) is one of the first HA solutions in our IT world and one of the hardest to configure. Although, I don’t have personal experience with MS Failover Clustering, but I know the severe pain of deploying, testing and troubleshooting this solution. Microsoft’s developed this solution so much since its first version. Available versions now are: MS Clustering Service on Windows 2008 R2 and Windows 2012 and 2012 R2. With vSphere 5.x, MSCS can now be virtualized and it’s fully supported by Microsoft.
In this part, we’ll talk about MSCS in Windows 2012 and best practices to deploy it in vSphere 5.x environments.These best practices are collected from published VMware best practices guides and Microsoft best practices guide regarding MSCS in Windows 2012. I followed the same style of the previous post and divided them into six categories, i.e. Design Qualifiers (AMPRS – Availability, Manageability, Performance, Recoverability and Security) and Scalability.
Availability:
1-) Use vSphere HA with MSCS to provide additional level of availability to your protected application.
2-) Use vSphere DRS with Partial Automated level with MSCS to provide automatic placement of clustered VMs when powered on only. Clustered VMs use SCSI BUS Sharing which requires not to migrate these VMs with vMotion and hence, Automatic DRS load balancing can’t be used. If the vSphere Cluster on which clustered VMs is configured with Automatic DRS, change VMs-specific DRS configuration to Partially Automated.
3-) Affinity Rules:
With Cluster-in-a-box configuration, use VMs Affinity Rule to gather all clustered VMs together on the same host. With Cluster-across-boxes or Physical-Virtual Cluster, use VMs Anti-affinity Rule to separate the VMs across different Hosts. HA doesn’t respect VM Affinity/Anti-affinity rules and when a host fails, HA may violate these rules. In vSphere 5.1, configure the vSphere Cluster with “ForeAffinePowerOn” option set to 1 to respect all VMs Affinity rules. In vSphere 5.5, configure the vSphere Cluster with both “ForeAffinePowerOn” & “das.respectVmVmAntiAffinityRules” set to 1 to respect all VMs Affinity/Anti-affinity rules respectively.
4-) Try to use VM Monitor to monitor activity of the clustered VMs and restart them in case of OS failure.
Performance:
1-) Don’t use Memory Over-commitment on ESXi hosts hosting clustered VMs. Memory Over-commitment may cause small pauses to these VMs which are so sensitive to any time delay and may cause false fail-over process.
2-) Storage:
a- SCSI Driver:
SCSI Driver Supported | OS (Windows) |
LSI-Logic Parallel | 2003 SP1 or SP2 32/64 bit |
LSI-Logic SAS | 2008 SP2 or 2008 R2 SP1 32/64 bit |
LSI-Logic SAS | 2012 (vSphere 5.5.x) or 2012 R2 (vSphere 5.5 U1 or later) |
Keep in mind that, you have to use different SCSI drivers for both of Guest OS Disk and shared Quorum Disk, i.e. both of SCSI (0:x) and (1:x).
b- Disk Types for OS Disks:
For OS disks in clustered VMs, it’s recommended to use Thick-provisioned Disks instead of Thin ones for max. performance.
c- Disk Types Supported for Shared Quorum Disk:
vSphere Version | Cluster Configuration Type | OS (Windows) | Disk Type | SCSI BUS Sharing |
vSphere 5.x | Cluster-in-a-box (Recommended Configuration) |
2003 SP1 or SP22008 SP2 or 2008 R2 SP1 | Eager-Zeroed Thick-Provisioned Virtual Disk (.vmdk): Local/on Fiber SAN. | Virtual |
Cluster-in-a-box | 2003 SP1 or SP22008 SP2 or 2008 R2 SP1 | Virtual-mode RDM Disk: on Fiber SAN. | Virtual | |
Cluster-across-boxes(Recommended Configuration) | 2003 SP1 or SP22008 SP2 or 2008 R2 SP1 | Physical-mode RDM Disk on Fiber SAN. | Physical | |
Cluster-across-boxes | 2003 SP1 or SP2 | Virtual-mode RDM Disk: on Fiber SAN. | Physical | |
Physical-Virtual | 2003 SP1 or SP22008 SP2 or 2008 R2 SP1 | Physical-mode RDM Disk on Fiber SAN. | Physical | |
vSphere 5.5 Only | Cluster-in-a-box (Recommended Configuration) |
2008 SP2 or 2008 R2 SP1 or 2012 or 2012 R2 (2012 R2 requires vSphere 5.5 U1) | Eager-Zeroed Thick-Provisioned Virtual Disk (.vmdk): Local/on iSCSI/FCoE SAN. | Virtual |
Cluster-in-a-box | 2008 SP2 or 2008 R2 SP1 or 2012 or 2012 R2 (2012 R2 requires vSphere 5.5 U1) | Virtual-mode RDM Disk: on iSCSI/FCoE SAN. | Virtual | |
Cluster-across-boxes(Recommended Configuration) | 2008 SP2 or 2008 R2 SP1 or 2012 or 2012 R2 (2012 R2 requires vSphere 5.5 U1) | Physical-mode RDM Disk on iSCSI/FCoE SAN. | Physical | |
Physical-Virtual | 2008 SP2 or 2008 R2 SP1 or 2012 or 2012 R2 (2012 R2 requires vSphere 5.5 U1) | Physical-mode RDM Disk on iSCSI/FCoE SAN. | Physical |
Keep in mind that:
– In-Guest iSCSI target sharing for Quorum Disk is supported for any type of clustering configuration and any OS.
– vSphere 5.5.x supports in-guest FCoE target sharing for Quorum Disk.
– Mixing between Cluster-across-box/Cluster-in-box configuration isn’t supported as well as mixing between different verison of vSphere in a single cluster.
– Mixing between different types of storage protocols connecting Quorum Disk isn’t supported, i.e. first node connected to Quorum Disk using iSCSI and the second is connected using FC.
– Mixing between different types of initiators for a storage protocol is supported only on vSphere 5.5.x, i.e. Host 1 can connect using Software iSCSI and Host 2 can connect using HW iSCSI Initiator. Same goes for FCoE.
d- Set shared RDM LUN on which Quorum Disk is placed as Perennially Reserved on each host participating, to prevent long time duration of starting of any ESXi host participating or hosting a clustered VM. Check the following KB for more information:
VMware KB: ESXi/ESX hosts with visibility to RDM LUNs being used by MSCS nodes with RDMs may take a long time to sta…
e- Storage Array Multi-pathing Policy:
For clustered VMs configuration on vSphere 5.1 that requires FC SAN, certain Multi-pathing policy must be set to configure how ESXi Hosts connect to that FC SAN:
Multi-pathing Plugin | SAN Type | Path Selection Policy |
NMP | Generic | Round Robin |
using SATP: ALUA_CX | EMC ClariionEMC VNX | Fixed |
using SATP: ALUA | IBM 2810XIV | MRU |
using SATP: Default_AA | IBM 2810XIV HitachiNETAPP Data ONTAP 7-Mode |
Fixed |
using SATP: SYMM | EMC Symmetrix | Fixed |
In vSphere 5.5 or above, this issue was resolved according to both of VMware KB: Using the PSP_RR path selection policy with MSCS results in quorum disk problems & VMware KB: MSCS support enhancements in vSphere 5.5.
3-) Guest Disk IO Timeout:
From Guest OS, it’s recommended to change Disk IO Timeout to more than 60 seconds from the following registry key:
“HKEY_LOCAL_MACHINESystemCurrentControlSetServicesDiskTimeOutValue”.
4-) Network:
a- You should choose the latest vNICs available to the Guest OS. The most preferred is VMXNET3 for both Private and Public networks. It gives highest throughput with least latency and least CPU overhead.
b- Try to set your port groups with -at least- 2 physical NICs for redundancy and NIC teaming capabilities. Connect each physical NIC to a different physical switch for max. redundancy.
c- Consider network separation between different types of networks, like: vMotion, Management, Production, Fault Tolerance, etc. Network separation is either physical or logical using VLANs.
d- Clustered VMs should have two vNICs, one for public network and the other one for heartbeat network. For Cluster-across-boxes, configure heartbeat network port group with two physical NICs for redundancy.
Manageability:
1-) VMware supported configuration of all Microsoft Clustering solution: Microsoft Clustering on VMware vSphere: Guidelines for supported configurations.
2-) Time Sync:
Time Synchronization is one of the most important things in SQL environments. It’s recommended to do the following:
a- Let all your SQL VMs sync their time with DC’s only, not with VMware Tools.
b- Disable time-sync between SQL VMs and Hosts using VMware Tools totally (Even after uncheck the box from VM settings page, VM can sync with the Host using VMware Tools in case of startup, resume, snapshotting, etc.) according to the following KB: VMware KB: Disabling Time Synchronization.
c- Sync all ESXi Hosts in the VI to the same Startum 1 NTP Server which is the same time source of your forest/domain.
3-) Supported OS’s and Number of Nodes:
No. of Nodes | OS (Windows) |
2 Nodes | 2003 SP1 or SP2 32/64 bit and per vSphere 5.1 hypervisors. |
2 Nodes | FCoE SAN hosting Quorum Disk with vSphere 5.1 U2 and Windows 2008/2012. |
5 Nodes | 2008 SP2 or 2008 R2 SP1 32/64 bit |
5 Nodes | Windows 2012 (vSphere 5.5.x) or Windows 2012 R2 (vSphere 5.5 U1 or later) |
5 Nodes | FC SAN hosting Quorum Disk with vSphere 5.1 U2 and Windows 2012 |
Recoverability:
1-) Try to maintain a proper backup/restore plan. This helps in case of total corrupt of a cluster node which requires a full restore on bare metal/VM. Keep in mind also to continuously test restoring your backup sets to test their effectiveness.
2-) Try to maintain a proper DR/BC plan. Clustering Configurations would not help a lot in case of total data center failure situation. Try to test your DR/BC plan from time to time, at least twice per year.
3-) You can use VMware Site Recovery Manager (SRM) when using MSCS VMs with array-based replication capabilities. For more information, check the limitation on vSphere 5.1 and vSphere 5.5.
Security:
1-) All security procedures done for securing physical Microsoft Clusters should be done in Clustered VMs, like: Role-based Access Policy.
2-) Follow VMware Hardening Guide (v5.1/v5.5) for more security procedures to secure both of your VMs and vCenter Server.
Scalability:
For greater scalability, try to upgrade your clustered VMs to Windows Server 2012. With vSphere 5.5.x and Windows Server 2012, Quorum Disk can be hosted on iSCSI or FCoE SAN. Issue of using Round Robin PSP is solved (under certain conditions mentioned in this KB).
I know that your mind is twisted by this line , but it’s MSCS as we know and unfortunately, it carries the same hard configuration with it to virtual world on vSphere 5.1/5.5. I hope that this guide was able to make it easy -even a little- to configure your Microsoft Cluster on vSphere. For more details or further explanation, refer to References section.
References:
** Virtualizing MS Business Critical Applications by Matt Liebowitz and Alex Fontana.
** Virtualizing MS Clustering Services on vSphere 5.1.
** Virtualizing MS Clustering Services on vSphere 5.5.
** vSphere Design Sybex 2nd Edition by Scott Lowe, Kendrick Coleman and Forbes Guthrie.
Update Log:
** 03/04/2015: Added Point 1 in Manageability Section.
In my experience, cluster validation fails if a shared data disk of an instance is on the same SCSI controller as the quorum disk. Cluster validation succeeds if the quorum disk has its own SCSI controller. I wish I could understand why. My configuration is a 4 node “classic” Windows 2016 failover cluster across 4 ESXi6.5U1 boxes. VM hardware 13/6.5. iSCSI SAN. Using PSP NMP with RR. No other drivers like powerpath installed. Excluded Metro Storage Cluster and Round Robin as factors.