Lets Talk About Protecting and Recovering VMs with #VCDR #vExpert


So now we have deployed and configured VCDR, now we can go into all the bits that allow you to protect your VMs and failover!

Protection Groups

So now we can get into the nitty-gritty of starting to create Protection Groups and Recovery Plans. This is the meat of it all, where you define what is going to be protected and how it should be recovered.

Creating a Protection Group

As you can see you can define how VMs are added to the protection group by using Naming Patterns, Tags or VM Folders.

In this example I have used vSphere Tags:

Using vSphere Tags

In my vCenter you can see I have created a tag specifically for VCDR to use:

My On Prem vSphere Tags

Then we can look at the various schedules:

Snapshot Schedules

By default, the lowest RPO you can have is 4 hours, then its 6, 8, 12, daily, weekly, monthly

For retention periods you have hours, days, weeks, months, years

Now if you have vSphere 7.0u2c+ at the source you can use the High-Frequency Snapshot Feature which gets you down to 30mins RPO.

With On-Prem vSphere 7.2u2c+ you can have 30 min snapshots

Once that has been created it will be listed under Protection Groups:

Newly Created Protection Group Listed

Regardless of what way you go about grouping VMs into protection groups, as long as they meet the criteria they will be added automatically, every time a snapshot is taken, as it queries the vCenter inventory each time.

Now it will show as empty, normally I would expect to see the VMs listed or something like that, but nope. Until it takes a snapshot you won’t know what’s protected:

Nothing Listed Till Snapshots Taken

After the schedule snapshot has been taken:

Snapshots Taken and Listed

It is just a normal snapshot with vCenter as seen in the VM tasks/events:

VM Tasks
VM Events

You can take additional snapshots on demand if you like and define their retention period:

On-Demand Snapshots

This is quite handy if you know you have some big changes coming along and you want to give yourself that extra safety net.

Recovery Plans

Just like with SRM once you have configured your Protection Groups, you must create your Recovery Plans. A Recovery Plan uses the protection groups and recovers them as you see fit.

Some customers have one master recovery plan, that recovers everything, and then it has more precise recovery plans to keep things more controlled.

For example, you could have a recovery plan for Test and another one for Prod and then have a third plan that recovered both, all depending on your requirements.

Recovery Plans are where resource mappings are configured. This is where you tell VCDR, where VMs in Cluster A should end up in VMConAWS Cluster B, same for datastores, networks, folders etc.

One key thing to note is you CAN’T have a many to one style of mapping. So, for example, you can’t have 2 source clusters mapped to one single SDDC cluster. The way around this is that I have used is to have multiple resource pools as shown below:

If you try to do multiple to one mapping you will get the already mapped error

As you can see my HCX cluster can’t be mapped to the Compute-ResourcePool because it has already been mapped for another source cluster. So, to get around this:

Created another Resource Pool in VMConAWS

I created a VCDR-Testing Recovery Pool and used that for the HCX Cluster.

In this example, I have 2 Protection Groups and I am configuring a new Recovery Plan to encompass both:

Multiple Protection Groups

For this Recovery Plan, I can do a totally different set of resource mapping as well, giving me a vast amount of flexibility.

Since this is for Dev VMs, I have made the choice to recover them all but NOT to power them on:

Different Recovery Options per Recovery Plan

There are various other options available:

Various Reocvery Options 1
Various Recovery Options 2
Various Recovery Options 3
Various Recovery Options 4

I can even set email alerts for this plan:

Sending Email Alerts

Continuous Compliance

If you’ve used SRM before you will know all about the compliance reports, you can pull to show the business and auditors that you are doing regular tests and that everything is configured correctly.

VCDR allows for these but also checks your Recovery Plans for configuration compliance to make sure everything is correct, and it does this every 30 mins, and you can force sync it as well.

I have a continuous compliance report below that is showing an error:

Continous Compliance Example

When you create a Protection Group that relies on vSphere tags, one of the key things you must remember is that whatever tags are being used, you must create them manually yourself at the destination. If you don’t do this, you will have issues and it warns you as such.

So, after I created the Tags in the SDDC, I could either wait 30 mins for it to run again or force the check manually and it passed:

Passing Compliance

You can also download a compliance report for audit purposes:

Downloading the Compliance Report

So now that is all done, we can now actually run a test!

Click inside the DR plan and you will be presented with these options:

Failover/Test/Deactivate Plan

Test Plan

This does exactly as you might expect, it recovers the VMs are per your Recovery Plan and brings them up but not attached to any network. So you can test for compliance and if things don’t go to plan, you can clean up and try again.

You can select what snapshot you want to use, so if you have a requirement to use a specific time and day of snapshot you are covered:

You can select various snapshots to use for the test depending on your requirements

You can then decide how it handles errors and whether the VMs should remain on the CFS or be moved over to the SDDC VSAN datastore:

Where to put the VMs for the test

You then confirm you want to run the test and away it goes:

Confirm you wish to run the test

Once the test has been completed you are presented with:

Completed test

Now when we look in the SDDC we see the VMs have been recovered:

Recovered VM in a powered-off state as per the recovery plan settings

As we can see it got the required tag, it is configured in the correct network, but it is NOT connected to it. My Recovery Plan stated that if the source VM was powered off, in a DR event then VM would be recovered and not powered on as well.

I did a test where I did not have the tags created in the SDDC to see what it did, and it recovered the VM the same but obviously, no tags were configured for it.

Once you are happy with your test, you can then clear it down and that’s done with a few clicks:

Clean up

It will then remove the VMs and go back to a ready state

Failover

Failover is very much the same process:

It confirms the compliance checks:

Compliance Checks

It confirms the snapshot is valid and no further replication is needed:

Confirm the snapshot you want to use

How it should handle any errors:

How will it handle errors for the failover

In a DR event, you would most likely just ignore errors of any kind as you just want to get up and running asap.

Then you confirm that you want to Failover:

Time to failover!

What is interesting is that it will downgrade the hardware level of v18/19 VMs down to v17, in normal circumstances downgrading VM hardware levels has been unsupported.

Once it’s done you have the final choice to Commit or Rollback, which is a nice feature:

Do you commit or rollback?!

Now one thing to keep note of, it assumes the Protected Site is dead, so if VMs are on etc, you need to turn them off yourself etc.

When you select Commit you are given the option to create a failback plan:

Datastore selection

If the VMs have come from different datastores, it will try and map them but if it can’t you can select a general datastore that will cover any straggler VMs.

The DR Plan you used to failover will be put into a deactivated state:

Deactivated

Then we activate the Failback DR Plan:

Failback plan activated

Interestingly the failback plan is not testable.

When you failover:

Failback Plan

So out of interest I left one of my VMs powered on at the source as I wanted to see how it would handle this scenario:

What VCDR does it failback the VM if it’s powered on it will force power it off, and reconfigure it and then power it back on with the recovered data:

Also when you failback from VMC to On-Prem the VMConAWS copy of the VMs are deleted and not left behind, which is the opposite of when you failover from On-Prem to VMConAWS.

Once it is done you can commit, and this time there is no rollback option:

It is time to commit!

You can create a failback plan again, so a failback of your failback, but in my experience customers in this scenario will probably want to take stock of everything and start fresh.

If you do create a failback of your failback Plan, you can test it and everything like you normally would for an On-Prem to Cloud DR plan:

Failback plan for the failback plan!

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.