Core Concepts: Define RTO and RPO (Veeam KB link)
Defining the protection scope:
- Find out first how many machines and how much disk space is currently used
- Utilising the above information, add it with the calculation of the daily change rate
- This information is critical to ensure that there is enough space available to protect the backup. Veeam will create a full backup file on first run, then each backup job after will run an incremental. The incremental backups will only backup the changed blocks, If the changed blocks is only small, then on a short and small backup will take place, if there are a large number of files changed, then the backup will take significantly longer and will take up additional space.
RPO (Recovery point Objective)
RPO and RTO are absolute requirements for a DR plan.
RPO is the point in time of which the latest backup is available, this is the accepted risk of amount of time/data that may be lost since the last backup.
This will also set how many backups will need to be taken to ensure an available copy within the window
RTO (Recovery Time Objective):
RTO is the time between the time of the incident to the time the environment or systems are available for use again. If you have a RTO that is 24hrs in your Disaster Recovery plan, this is the agreed upon time that the systems should take to be online again and available. This time can be calculated by running failover tests or recovery scenarios in which the steps to recovery involved would be tested and timed.
Planning RTO and RPO
There are different recovery strategies that can be used, all of which can range from a short downtime to a longer recovery. Achieving a short downtime may require additional features and services to be available.
In order to have an RTO and RPO within seconds, you can utilise Veeam CDP, however, this would require having two separate sites, Prod and DR, to live replicate and failover to in near real-time, the data would be synchronised instantly to ensure the recovery point objective is meter.
For a RPO of minutes but an RTO of seconds, Veeam Replication will give you the ability to have have an RPO within minutes, but your recovery time can be within seconds from failing over to the environment starting up. This is due to way replications are created as these first create the snapshot like a backup, and then run a job to applying the changes to the target. This can run every few minutes.
For a recovery time within minutes, the Veeam Snapshot Orchestration (within VMware vSphere) creating a chain of application consistent array-based snapshots which is then able to be mirror/replicated to a secondary array.
Moving into the having an RPO of less than 24 hours and a recovery time that is generally within minutes or within a day is the straight up backup. Generally, a backup is taken every 24 hrs at the end of the day, some may be configured to take a backup every hour or every 2 hours depending on the requirement, however, this will creates a much longer RPO to go back to and thus more changes that occurred during the day that may not be backed up.
If you find yourself in a situation where your backups are unavailable, then you will be looking at Backup Copies where your RPO is going to be within the 24-48 hour range, and your RTO could come within minutes, but generally a number of hours depending on where your backup copies are stored. This allows you to keep a copy offsite, in another data centre. As this is a copy of the backup, the calculation would first involve the frequency of the backup + the time for the backup copy to complete its copy process to the next location.
To really get into the 3-2-1 rule, although it is going to put both the RPO and RTO into the hours with significant recovery operator involvement, Tape is a great way to air-gap your backups and have an offsite/disconnected copy. The reason that this is the slowest solution is the speed of which tapes read and write at. Tapes generally also only hold upwards of a few TB and thus require swapping, and if you are using GFS (Grandfather-father-son) method, depending on where in the chain the recovery is being processed may require a number of tapes to be swapped.
Summarising RTO and RPO planning:
Depending on the budget and resources available, the RTO and RPO can differ significantly, and with less equipment, the RTO can be much higher. For instance, without a failover site, both the RTO and RPO could take significantly longer times as the best option, if supported by the array, is Snapshot Orchestration where the data is available on the array already. If not support, then you do not have a second site, you would then be recovering for a backup, and if those backups have been affected by the disaster as well, the recovery from Backup Copies or Tape are your real remaining options.
With planning and budget, RTO and RPOs can be lowered and almost require only small amounts of restore operator assistance to get production back up and running.
Keep in mind, the longer the RPO, then the bigger the gap of data that isn’t backed up and available when the environment is recovered.