Disaster Recovery and Business Continuity Planning
Disaster recovery (DR) and business continuity (BC) ensure an organization can recover from disruptions. DR focuses on restoring IT systems after an incident; BC ensures critical business functions continue during and after a disaster.
Business Continuity vs Disaster Recovery
Business Continuity (BC) is the broader plan — how the organization maintains essential functions during and after a disaster. It covers people, processes, facilities, and technology.
Disaster Recovery (DR) is a subset of BC — specifically focused on restoring IT systems and data after a disruptive event. DR is about getting technology back online.
Both are documented in formal plans: BCP (Business Continuity Plan) and DRP (Disaster Recovery Plan). These plans are tested through tabletop exercises (discussion-based) and full-scale drills (operational).
- BC: keeps the business running during a disruption (broader)
- DR: restores IT systems after a disaster (technology-focused)
- BCP and DRP should be tested regularly through exercises
- Plans must be reviewed and updated annually
RTO and RPO
Recovery Time Objective (RTO) — the maximum acceptable time a system can be unavailable after a disaster. For example, an RTO of 4 hours means the system must be restored within 4 hours. RTO drives decisions about redundancy and failover.
Recovery Point Objective (RPO) — the maximum acceptable data loss measured in time. For example, an RPO of 1 hour means up to 1 hour of data could be lost. RPO drives backup frequency.
Both are determined during the Business Impact Analysis (BIA). Shorter RTO/RPO mean higher cost (more infrastructure, more frequent backups). The BIA helps balance cost vs recovery requirements.
- RTO = time to restore (how long can you be down?)
- RPO = acceptable data loss (how much data can you lose?)
- RTO drives infrastructure decisions (hot site, cold site)
- RPO drives backup strategy (hourly, daily, real-time replication)
- Shorter RTO/RPO = higher cost
Backup Strategies
Full Backup — copies all data. Slowest to create, fastest to restore. Large storage requirement.
Incremental Backup — copies only data changed since the last backup (full or incremental). Fastest to create, but restore requires the last full backup + all incrementals in sequence.
Differential Backup — copies data changed since the last full backup. Larger than incremental but faster to restore (only need the full + latest differential).
The 3-2-1 Rule: keep at least 3 copies of data, on 2 different media types, with 1 copy offsite. Offsite backups protect against physical disasters like fire or flood.
Backup destinations: disk (local NAS), tape (long-term archival), cloud (S3, Azure Blob), or hybrid.
- Full: complete copy — slow backup, fast restore
- Incremental: changes since last backup — fast backup, slow restore
- Differential: changes since last full — medium backup, medium restore
- 3-2-1 Rule: 3 copies, 2 media, 1 offsite
- Test restores regularly — backups are useless if you can't restore
Recovery Sites
Hot Site — a fully operational duplicate of the primary site. Has all equipment, data, and personnel ready to take over immediately. RTO of minutes to hours. Most expensive option.
Warm Site — partially configured with some equipment but not fully operational. Data may be slightly out of sync. RTO of hours to days. Balance of cost and speed.
Cold Site — just the facility (power, cooling, cabling) but no equipment. Everything must be procured and configured after a disaster. RTO of weeks. Cheapest option.
Other considerations: mobile sites (temporary facilities like a portable datacenter in a trailer) and cloud-based recovery (failover to cloud infrastructure).
- Hot site: fully operational, take over immediately — highest cost
- Warm site: partially configured, needs some setup — moderate cost
- Cold site: empty facility, full setup required — lowest cost
- Cloud DR: replicate to cloud provider, fail over when needed
- Site selection: must be geographically separated from primary site
High Availability and Failover
High availability (HA) eliminates single points of failure through redundancy. Common HA techniques: load balancing (distributing traffic across multiple servers), clustering (multiple servers acting as one), redundant power supplies, RAID (disk redundancy), and redundant network paths.
Failover — automatic switching to a redundant system when the primary fails. Active-Passive (standby system takes over) vs Active-Active (all systems handle traffic, if one fails others absorb the load).
Failback — returning to the primary site after it's restored and verified.
- HA eliminates single points of failure through redundancy
- Load balancers distribute traffic and provide automatic failover
- RAID 1/5/6/10 provide disk redundancy
- Active-Passive: standby takes over on failure
- Active-Active: all systems active, failure means reduced capacity but no downtime
Exam Tip
RTO vs RPO is a guaranteed exam question. RTO = time to restore (uptime), RPO = data loss (backups). Hot/warm/cold sites: know the cost and recovery time trade-offs. 3-2-1 backup rule and RAID levels are frequently tested.