Introduction
I was recently asked to produce a presentation for an Xtravirt internal working group session, where I would be covering a “refresher” of VMware’s Site Recovery Manager (SRM). Naturally I jumped at the opportunity because it has been a few years since I have worked with the Site Recovery Manager. That being said, I also had to run myself through somewhat of a refresher beforehand.
As part of the presentation, I put together a five-minute demo video where I cover the process of performing a controlled failover within the Site Recovery Manager from the production site to the recovery site.
What is a Site Recovery Manager?
To quote VMware “SRM is the industry-leading disaster recovery (DR) management solution, designed to minimize downtime in case of a disaster. It provides policy-based management, automated orchestration, and non-disruptive testing of centralized recovery plans. It is designed for virtual machines and scalable to manage all applications in a vSphere environment.”
What has changed?
Since I last worked with SRM there have been some major changes, specifically around the operating platform. Historically, the Site Recovery Manager was an application that had to be installed on a Microsoft Windows Server, with a reliance on an external SQL database.
This changed some time ago back in 2019 and the Site Recovery Manager is now packaged as an appliance from VMware, running on PhotonOS utilising an embedded deployment of vPostgres for the database as well.
When we start to look at the logical architecture of SRM, there are several different components that come into the scope of a deployment:
- Site Recovery Manager Server – This integrates with an underlying replication technology to provide policy-based management, non-disruptive testing and automated orchestration of recovery plans.
- vSphere Replication Appliance – vSphere Replication is a proprietary, host-based replication engine to replicate VMware virtual machines to the recovery site.
- Storage Replication Adaptors – Integrates with third-party storage array-based replication products for data replication.
- VMware vCenter Server – vCenter Server is a centralised platform for managing your VMware vSphere environment.
- Platform Services Controller – Provides infrastructure services for the environment including Single-Sign-on, Licensing, and Certificates.
For the purpose of the demonstration video above, I utilised vSphere Replication as I didn’t have the hardware available within my lab environment to facilitate an array-based replication.
Resource Mappings
There are several key concepts to mention regarding the Site Recovery Manager, specifically around resource mappings; resource mappings are established following the creation of a site paring. We have three different resources that we map to ensure that when a failover is initiated the virtual machine is bought up in the recovery site correctly. The mappings consist of:
- Virtual networks – ensuring that the recovered virtual machine is connected to the correct network in the recovery site.
- Folders – ensuring that recovered virtual machines are in the correct folder structure in the recovery site. This could be particularly key for organisations that have a permission structure in place within vCenter based on the virtual machine folder structure.
- Resource mappings – ensuring that the recovered virtual machine is powered up on the correct compute at the recovery site. This could be a top-level vSphere Cluster or a Resource Pool contained within a cluster.
Defining a Placeholder Datastore
Additionally, we also must define a “placeholder datastore”. A placeholder in the Site Recovery Manager’s context is a subset of virtual machine files, these are very small and do not represent a full copy of the virtual machine that you are protecting.
There are no VMDKs attached to the placeholder virtual machine object, this serves as a reservation if you will for the compute on the recovery site. You may have noticed that we haven’t discussed a mapping for the datastore where the virtual machine will be located on the recovery site.
That’s because this isn’t a mapping that is configured in the Site Recovery Manager, this (in the case of this demo) is handled by vSphere Replication, and the target datastore is configured within the replication task, along with the replication RPO requirements as well. If you are using array-based replication and have an SRA adapter installed into the Site Recovery Manager then this is configured at the array level between the LUNs/devices.
Protections Groups and Recovery Plans
Finally, within SRM we have protection groups and recovery plans. Protection groups serve as a grouping of virtual machines that you want to failover together. For example, if we use the classic three-tier app example of a web server, application server and a database server failing these three virtual machines over together would ensure no elements of the application structure are missed.
A recovery plan builds on a protection group, which allows you to pull multiple protection groups together and failover within the same task. Additionally, this is the section that we define and build upon our orchestration. We can set within the recovery plan, virtual machine power-up priorities and power-up dependencies as we wouldn’t want the web and application tiers of our application being powered up before the database is ready.
As well as any pre or post-power on steps, or IP customisation that may need to take place within the guest operating system if we are not fortunate enough to have underlying networking capabilities to stretch the layer two networks over multiple sites.