As VMware launched vSphere 6.0 yesterday, many of us waited to see the rumors, or the features seen in beta versions, to be true or not. For me, one of the most waited features was the new Fault Tolerance feature.
New FT feature now supports 4 vCPUs and 64GB of RAM on non latency-sensitive VMs. As stated by VMware: “It can handle nearly 90% of total workloads”. So, is it the only difference between FT in vSphere 5.x and vSphere 6.0? The answer is absolutely NO.
In vSphere 5.x, FT was based on a technology called vLockstep, which keeps the Primary and Secondary VMs in constant synchronization. When a x86 instruction is sent to the Primary VM, it’s copied and sent to the Secondary VM as well, through the Logging Network, to be executed. Only the results and output from the Primary VM was published to the outer world, when the outputs from the Secondary VM, which are the same, are terminated. For this to be done successfully, the two copies should share the same vmdk disk with the Primary VM is the only to write to the disk.
When the host of the Primary VM is down, the Secondary becomes active seamlessly and momentarily without any VM network disconnection, then another Secondary is re-spawned. This operation proved to be successful for a single-vCPU VM till now.
Another thing, the initial creation of any Secondary VM is done through vMotion network, while any other logging is done through Logging NIC. When a Secondary VM is to be created, a process of vMotion copies the computing state of the Primary VM (CPU & Memory) to another host, but instead of terminating the original one after completing successfully, the two are preserved.
Although this technology is network intensive, not so much as you’ll see, but a single 1 Gb/s Logging NIC on the host was enough in many cases that uses only one or two FT VMs on a single host. Another 1 Gb/s vMotion NIC is required for fast initial creation.
Now let’s move to vSphere 6.0. In vSphere 6.0, VMware changed the base technology of FT to use Fast Checkpointing Techonology which is “heavily modified XvMotion code” as stated by VMware. This technology create multiple checkpoints (I think snapshots they mean!!) per second of the Primary VM without stopping till Primary is down. And as this technology is based on XvMotion code, this removed the limit of shared vmdk and shared storage. New FT technology allows for a completely-separated Primary and Secondary VMs, each on a different host and a different storage.
I think (confirmation is needed) that they couldn’t use the old vLockstep, as the disk isn’t shared between the two, i.e. they have to use a technology that copies both of the computing state (instructions) as well as reads and writes on the disk. The existing technology that do that is Snapshot, that takes a point in time of both computing state (CPU and Memory), by selecting “Snapshot the VM’s memory & Quiesce Guest File System” options, and vmdk file.
Here’s my imagination how this technology actually:
1-) Initial creation of Secondary VM is done through XvMotion of a copy of the Primary VM to another host and another storage. For more info about XvMotion, refer to Frank Dennman’s article about it.
2-) After first copy is created, and using Logging Network, a constant stream of checkpoints (snapshots) is taken from Primary VM and sent, using XvMotion, to be applied on the Secondary VM while terminating any output from the Secondary VM to the outer world. That the stream of checkpoints is seamlessly consolidated on the Secondary VM to be ready when any failure happens to the host of the Primary VM.
3-) When the host of the Primary VM fails down, the Secondary VM is promoted to be a Primary on and the process is repeated to create another Secondary.
For 4-vCPUs VMs, this constant stream of checkpoints is so network-intensive, that VMware stated that 10Gb/s dedicated Logging NIC is highly recommended.
Now, after we got a quick peek on the changes, some Qs remains:
1-) How many FT VMs are allowed per host after this huge enhancements?
2-) Will I be able to choose to separate Primary and Secondary VMs’ disks or make them share the same disk? In other words, to be able to choose between using new Fast Checkpointing technology or old vLockstep?
3-) What if the storage holding one disk of either the Primary or Secondary VMs fails? What will happen then? Will FT capabilities will extend to protect against Storage failures too?
Unfortunately, we can’t test new FT in the HOL released by VMware yesterday. We’ll have to wait till further information is published. Till then, I can say that this huge enhancement will add a new level of availability for many more workloads than vSphere 5.x as well as a new challenge in designing you VI. Let’s wait and see 🙂