IT Guides
What Are Backup and Disaster Recovery? Practical Guide to RPO, RTO, 3-2-1 Backup and Ransomware Recovery
A practical guide to backup and disaster recovery covering RPO, RTO, 3-2-1 backup, immutable and offline backups, database and Kubernetes backup, ransomware recovery, DR runbooks, restore testing, and a 90-day rollout roadmap.
💡Key Takeaways
- A practical guide to backup and disaster recovery covering RPO, RTO, 3-2-1 backup, immutable and offline backups, database and Kubernetes backup, ransomware recovery, DR runbooks, restore testing, and a 90-day rollout roadmap.
What Are Backup and Disaster Recovery? A Practical Guide to RPO, RTO, 3-2-1 Backup, Immutable Backup and Ransomware Recovery
Raster/preview image checked for display before being inserted into this Markdown file, used to illustrate Kubernetes backup and disaster recovery. Not SVG.1
Raster/preview image checked for display before being inserted into this Markdown file, used to illustrate open-source backup tooling. Not SVG.2
Raster/preview image checked for display before being inserted into this Markdown file, used to illustrate deduplicating backup. Not SVG.3
Quick summary
Backup is a copy of data, configuration or systems that can be restored after deletion, corruption, hardware failure, ransomware, deployment failure or infrastructure outage. Disaster Recovery, or DR, is the plan, architecture and operating process used to restore services after a major incident, not just restore files.
NIST SP 800-34 Rev. 1 describes contingency planning guidance that helps organizations understand the purpose, process and format of information system contingency plans, and evaluate systems and operations to determine recovery requirements and priorities.4 AWS Well-Architected Reliability Pillar states that RTO and RPO are restoration objectives, and DR strategy should be based on business needs, workload resources, disruption probability and recovery cost.5
Simple version: backup answers “do we still have the data?”; disaster recovery answers “can the service run again, how fast, and how much data did we lose?”
Why backup alone is not enough
A backup is useful only if you can restore the right data, from the right point in time, into the right system, within RTO/RPO limits.
Teams often have backups but still fail recovery because:
- restores were never tested;
- ransomware encrypted backup repositories;
- database transaction logs are missing;
- encryption keys are unavailable;
- app configuration or IaC is missing;
- backups are incompatible with new versions;
- restore is too slow;
- the clean restore point is unknown;
- backups are in the same compromised cloud account;
- no runbook defines who does what during an incident.
Backup must be paired with DR planning, restore tests, monitoring, access control and incident runbooks.
Backup, High Availability and Disaster Recovery
| Concept | Goal | Example |
|---|---|---|
| Backup | preserve recoverable copies | hourly database snapshots, object backups |
| High Availability | reduce downtime from component failure | load balancers, multi-node clusters, failover |
| Disaster Recovery | recover after major disruption | region failover, rebuild cluster, restore database |
| Business Continuity | keep business operations running | manual processes, customer communication |
AWS also distinguishes Availability and Disaster Recovery: both rely on practices such as monitoring, multi-location deployment and automatic failover, but DR focuses on copies of the entire workload and recovery time after disaster.5
What are RPO and RTO?
The two most important DR metrics are RPO and RTO.
| Metric | Meaning | Example |
|---|---|---|
| RPO | Recovery Point Objective, maximum acceptable data loss | lose at most 15 minutes of data |
| RTO | Recovery Time Objective, maximum acceptable restoration time | restore service within 2 hours |
| MTD | Maximum Tolerable Downtime | maximum 8 hours |
| WRT | Work Recovery Time after technical restore | reconciliation, validation, reopening operations |
Example:
Payment system:
RPO = 5 minutes
RTO = 30 minutes
Internal blog:
RPO = 24 hours
RTO = 2 days
Not every system needs very low RPO/RTO. Lower RPO/RTO usually costs more.
What is 3-2-1 backup?
Common rule:
3 copies of data
2 different storage/media types
1 offsite copy
For ransomware resilience, many teams extend this to:
3-2-1-1-0
3 copies
2 storage types
1 offsite copy
1 offline or immutable copy
0 errors after verification/restore testing
Practical meaning:
- production data is not the only copy;
- backup is not in the same permission boundary as production;
- at least one copy resists deletion/modification;
- restore is tested;
- backup job failures are monitored.
Immutable and offline backups
Immutable backup means backup data cannot be modified or deleted during a retention period, even with normal credentials. Examples include object storage Object Lock, WORM storage or backup appliances with immutability.
AWS S3 Object Lock stores objects using a write-once-read-many model and can prevent object deletion or overwrite for a fixed retention period.6 Azure Blob immutable storage supports WORM retention through time-based retention or legal hold policies.7
Offline backup means a backup copy is not continuously connected, such as tape, offline disk, offline vault or air-gapped repository.
Recommendations:
- keep at least one immutable/offline copy;
- separate backup admin from production admin;
- do not let apps or CI/CD delete backups;
- enable MFA delete or equivalent where available;
- monitor deletion and retention changes;
- test restore from immutable/offline copies.
Types of backup
| Type | Description | Strength | Weakness |
|---|---|---|---|
| Full backup | backs up everything | simple restore | time and storage heavy |
| Incremental | backs up changes since previous backup | efficient | restore depends on chain |
| Differential | backs up changes since last full backup | easier restore than incremental | grows over time |
| Snapshot | captures volume/storage state | fast | may not be app-consistent |
| Logical backup | exports data in logical format | portable | slow for large databases |
| Physical backup | copies data files/blocks | efficient for large DBs | version/config sensitive |
| Continuous backup | journals changes continuously | low RPO | more complex and costly |
| Replication | copies changes elsewhere | fast failover | mistakes/deletions may replicate |
There is no universal best backup type. Choose based on RPO, RTO, data, cost and operations.
Application-consistent vs crash-consistent
Storage snapshots are fast, but not always safe for databases.
| Type | Meaning |
|---|---|
| Crash-consistent | like the system after sudden power loss |
| Application-consistent | app/database flushed data before backup |
| Transaction-consistent | database can recover to a consistent transaction point |
For databases, understand:
- whether snapshots integrate with database freeze/hooks;
- whether logs/WAL/binlogs are backed up;
- whether restore needs log replay;
- whether queries are tested after restore;
- whether multi-volume consistency is guaranteed.
Database backup
PostgreSQL
PostgreSQL documentation covers SQL dump, file-system level backup and continuous archiving/point-in-time recovery.8
Common strategy:
Daily base backup
+ continuous WAL archiving
= point-in-time recovery
Checklist:
- use
pg_dumpfor smaller logical backups; - use physical base backup for larger systems;
- archive WAL;
- test PITR;
- back up roles/users/extensions;
- verify checksums;
- monitor replication lag;
- encrypt backups;
- record PostgreSQL version.
MySQL
MySQL documentation includes backup and recovery guidance.9 Consider:
- logical dumps with
mysqldump; - physical backups with suitable tools;
- binary logs for point-in-time recovery;
- replication;
- GTID;
- InnoDB consistency;
- users/grants backup.
MongoDB
MongoDB documentation covers backups.10 Consider:
- snapshots for replica sets or sharded clusters;
mongodumpwhere appropriate;- oplog-based recovery strategies;
- version compatibility;
- restore tests.
Kubernetes backup
Kubernetes backup is not just backing up container images. Back up:
- Kubernetes resources: Deployments, Services, Ingresses, ConfigMaps, Secrets, CRDs;
- PersistentVolumes and PVC data;
- Helm values;
- namespace labels/annotations;
- RBAC;
- admission policies;
- cluster-level configuration;
- external dependencies;
- etcd for self-managed clusters.
Kubernetes docs discuss backing up clusters, including backing up etcd for cluster state.11 Velero provides tools for backing up and restoring Kubernetes cluster resources and persistent volumes; its docs say Velero can back up clusters, restore after loss, migrate resources to other clusters and replicate production clusters to development/testing clusters.12
Example:
velero backup create prod-backup --include-namespaces app-prod
velero restore create --from-backup prod-backup
For managed Kubernetes, control plane etcd may be provider-managed, but application resources and persistent volumes still need a backup strategy.
SaaS backup
Many teams assume SaaS backup is fully handled by the provider. Verify the shared responsibility model.
Check:
- Google Workspace/Microsoft 365 retention;
- GitHub/GitLab repository backup;
- Jira/Confluence/Notion export;
- CRM/billing/helpdesk export;
- IAM/SSO configuration;
- audit logs;
- admin account recovery;
- deletion retention;
- API export rate limits;
- legal/compliance retention.
A SaaS provider may protect its infrastructure, but not always protect you from accidental deletion, insider misuse, bad retention policy or account compromise.
Backup encryption and key management
Backups often contain the most sensitive data. Encryption and key management matter.
Checklist:
- encryption in transit;
- encryption at rest;
- keys stored outside backup repository;
- encryption keys not stored only with backups;
- key rotation plan;
- tested restore with real keys;
- key recovery process for staff turnover;
- separate backup read permissions from production permissions;
- log backup access;
- back up required key/certificate/secret-manager metadata.
A backup without a decryption key is not recoverable.
Ransomware recovery
Ransomware makes backup harder because attackers may:
- delete backups before encrypting production;
- encrypt online backup repositories;
- steal backup data;
- wait until clean backups are overwritten;
- compromise domain or backup admins;
- disable monitoring and logs;
- attack the identity provider.
Strategy:
Immutable/offline backup
+ privileged access hardening
+ backup anomaly detection
+ clean restore point identification
+ isolated recovery environment
+ malware scan before reconnect
+ identity rebuild plan
Do not restore directly into a production environment that may still be compromised. Use an isolated recovery environment first.
DR strategies
AWS Well-Architected discusses DR strategies with different cost and recovery characteristics.5
| Strategy | Description | Typical RTO/RPO | Cost |
|---|---|---|---|
| Backup and Restore | restore from backup when needed | higher | lower |
| Pilot Light | keep minimal core resources in DR site | medium | medium |
| Warm Standby | run smaller live version in DR site | lower | higher |
| Multi-site Active/Active | multiple active sites | very low | very high |
Do not choose active/active just because it sounds best. It increases complexity, cost and operational risk.
What a DR runbook should include
A recovery runbook should be clear enough for on-call staff to execute.
Include:
- DR activation criteria;
- system priority list;
- system owners;
- RPO/RTO;
- emergency contacts;
- backup locations;
- required credentials/keys;
- restore order;
- restore commands;
- data validation steps;
- DNS/traffic switch steps;
- customer communication;
- rollback steps;
- incident evidence logging;
- post-incident checklist.
Do not store the only copy of the runbook in a wiki that might be down. Keep an independent/offline copy.
Restore testing
A backup that was never restored is not proven.
Test types:
| Test | Goal |
|---|---|
| File restore test | restore a single file |
| Database restore test | restore DB into test environment |
| PITR test | restore to a specific point in time |
| Full system restore | rebuild full application |
| Region failover drill | move traffic to DR site |
| Tabletop exercise | test decisions and process |
| Game day | controlled realistic incident simulation |
Measure:
- actual restore time;
- actual data loss;
- runbook errors;
- missing credentials/keys;
- forgotten dependencies;
- slow manual steps.
Backup monitoring
Monitor:
- backup job success/failure;
- backup duration;
- backup size;
- restore point count;
- age of latest restore point;
- replication lag;
- immutable retention;
- object lock status;
- verification errors;
- ransomware-like deletion patterns;
- storage cost;
- restore test status.
Useful alerts:
Backup job failed twice
No new restore point in 6 hours
Backup size dropped by 80%
Retention/object lock changed
WAL/binlog archiving stopped
Velero backup failed
Restore test exceeded RTO
30/60/90-day rollout roadmap
Days 1–30: inventory and risk reduction
- Inventory systems, databases, SaaS, storage and Kubernetes clusters.
- Define owner, RPO and RTO for critical systems.
- Verify whether existing backups can restore.
- Enable backups for critical databases.
- Create offsite backup copies.
- Separate backup admin and production admin roles.
- Encrypt backups.
- Add backup failure alerts.
- Write basic restore runbooks.
- Test restore for a small database.
Days 31–60: standardization and ransomware resilience
- Implement 3-2-1 or 3-2-1-1-0.
- Enable immutable backup/object lock for critical data.
- Enable WAL/binlog/PITR where low RPO is required.
- Back up IaC, GitOps repos and secrets metadata.
- Add Velero or another Kubernetes backup tool if using K8s.
- Create isolated recovery environment.
- Monitor backup age, size and anomalies.
- Review backup deletion permissions.
- Schedule restore testing.
- Run ransomware tabletop exercise.
Days 61–90: full DR maturity
- Select DR strategy: backup/restore, pilot light, warm standby or multi-site.
- Test full application restore.
- Measure actual RTO/RPO.
- Improve runbooks based on test results.
- Automate repeatable restore steps.
- Test DNS/traffic failover.
- Back up critical SaaS data.
- Build DR readiness dashboard.
- Test recovery when identity provider is unavailable.
- Review storage cost and retention.
Quick backup checklist
Data
- Databases.
- Object storage.
- File uploads.
- Persistent volumes.
- Config files.
- Secrets metadata.
- IaC/GitOps repos.
- SaaS exports.
- Audit logs.
- Encryption keys or key recovery procedure.
Technical controls
- Full + incremental/differential strategy.
- PITR for critical databases.
- Offsite copy.
- Immutable/offline copy.
- Encryption.
- Access control.
- Monitoring.
- Restore testing.
- Retention policy.
- Documentation.
Process
- Owner.
- RPO/RTO.
- Runbook.
- Escalation.
- Communication plan.
- Restore approval.
- Evidence collection.
- Postmortem.
- Periodic DR drills.
Common mistakes
- Backups exist but were never restored.
- Backups live in the same compromised account.
- No immutable/offline copy.
- No transaction log backup.
- No config/IaC backup.
- No RPO/RTO.
- Paper RTO does not match real restore time.
- Lost backup key.
- Unencrypted backups with secrets.
- Database backed up but file uploads forgotten.
- Kubernetes manifests backed up but PV data missing.
- Restoring directly into a still-compromised environment.
- No backup failure monitoring.
- No backup storage cost review.
- No accountable owner.
Reference tooling
| Need | Tool/service |
|---|---|
| File/server backup | Restic, BorgBackup, Duplicity |
| Kubernetes backup | Velero, Kasten K10, cloud-native backup |
| Database backup | pgBackRest, WAL-G, Percona XtraBackup, native tools |
| Object immutability | S3 Object Lock, Azure Immutable Blob, GCS retention policy |
| Snapshots | cloud snapshots, storage array snapshots |
| Backup monitoring | Prometheus/Grafana, backup software alerts |
| DR orchestration | cloud DR services, Terraform/OpenTofu, runbooks |
| Config backup | Git, artifact registry, IaC repositories |
| Ransomware protection | immutable backup, EDR, identity hardening, segmentation |
Restic is an open-source backup program focused on secure, efficient backup; BorgBackup is a popular deduplicating backup program for Linux/Unix-like systems.1314
FAQ
How are Backup and Disaster Recovery different?
Backup is a copy of data or systems. Disaster Recovery is the strategy and process for restoring service after a major incident according to defined RPO/RTO.
What are RPO and RTO?
RPO is maximum acceptable data loss. RTO is maximum acceptable time to restore service.
What is 3-2-1 backup?
3-2-1 backup means 3 copies of data, on 2 different storage/media types, with 1 offsite copy. For ransomware, add immutable/offline copies and restore verification.
Does immutable backup stop ransomware?
It reduces the risk of ransomware deleting or encrypting backups, but it is not enough alone. You still need access separation, monitoring, identity security, malware scanning and isolated recovery.
Do Kubernetes clusters need backup?
Yes. Back up resources, CRDs, RBAC, ConfigMaps, Secrets and PersistentVolumes. Velero is a common Kubernetes backup/restore tool.12
How often should restore testing happen?
It depends on criticality. Critical systems should be tested monthly or quarterly and after major changes. At minimum, every important backup path needs a real restore test.
Conclusion
Backup and Disaster Recovery are core IT resilience capabilities. A system without reliable backups may lose data permanently; a system with backups but no DR plan may still suffer unacceptable downtime or restore the wrong data. The core work is defining RPO/RTO, keeping offsite and immutable copies, protecting keys and access, testing restores and maintaining clear runbooks.
A practical rollout starts with inventory, checking current backups, testing restore, adding monitoring and creating offsite copies. Then mature into immutable backups, database PITR, Kubernetes backup, isolated recovery, DR drills and recovery automation. A backup has value only when you have proven it can be restored.
References
Footnotes
-
GitHub Open Graph preview image for
vmware-tanzu/velero. https://opengraph.githubassets.com/backup-dr-guide/vmware-tanzu/velero ↩ -
GitHub Open Graph preview image for
restic/restic. https://opengraph.githubassets.com/backup-dr-guide/restic/restic ↩ -
GitHub Open Graph preview image for
borgbackup/borg. https://opengraph.githubassets.com/backup-dr-guide/borgbackup/borg ↩ -
NIST SP 800-34 Rev. 1. “Contingency Planning Guide for Federal Information Systems.” https://csrc.nist.gov/pubs/sp/800/34/r1/final ↩
-
AWS Well-Architected Reliability Pillar. “Plan for Disaster Recovery (DR).” https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/plan-for-disaster-recovery-dr.html ↩ ↩2 ↩3
-
AWS S3 User Guide. “Using S3 Object Lock.” https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html ↩
-
Microsoft Learn. “Immutable storage for Azure Blob Storage.” https://learn.microsoft.com/en-us/azure/storage/blobs/immutable-storage-overview ↩
-
PostgreSQL Docs. “Backup and Restore.” https://www.postgresql.org/docs/current/backup.html ↩
-
MySQL Docs. “Backup and Recovery.” https://dev.mysql.com/doc/refman/8.4/en/backup-and-recovery.html ↩
-
MongoDB Docs. “Backups.” https://www.mongodb.com/docs/manual/core/backups/ ↩
-
Kubernetes Docs. “Backing up a cluster.” https://kubernetes.io/docs/concepts/cluster-administration/backing-up/ ↩
-
Velero Docs. “Overview.” https://velero.io/docs/main/ ↩ ↩2
-
Restic documentation. https://restic.readthedocs.io/en/stable/ ↩
-
BorgBackup official website. https://www.borgbackup.org/ ↩
Written by PixelRouter Editorial Team
We publish deep, authoritative guides on AI infrastructure, API gateway security, cloud financial management, and system optimizations for developers.
FAQ
How are Backup and Disaster Recovery different?
Backup is a copy of data or systems. Disaster Recovery is the strategy and process for restoring service after a major incident according to defined RPO and RTO.
What are RPO and RTO?
RPO is the maximum acceptable data loss. RTO is the maximum acceptable time to restore service.
What is 3-2-1 backup?
3-2-1 backup means keeping 3 copies of data, on 2 different storage or media types, with 1 offsite copy. For ransomware resilience, teams may add immutable or offline copies and restore verification.
Does immutable backup stop ransomware?
Immutable backup reduces the risk of ransomware deleting or encrypting backups, but it is not enough alone. Access separation, monitoring, identity security, malware scanning and isolated recovery are still needed.
Do Kubernetes clusters need backup?
Yes. Kubernetes backup should include resources, CRDs, RBAC, ConfigMaps, Secrets and PersistentVolumes. The article notes Velero as a common Kubernetes backup and restore tool.
How often should restore testing happen?
It depends on criticality. The article recommends that critical systems be tested monthly or quarterly and after major changes. At minimum, every important backup path needs a real restore test.