Disaster Recovery and Business Continuity Planning

Disaster recovery (DR) and business continuity (BC) ensure an organization can recover from disruptions. DR focuses on restoring IT systems after an incident; BC ensures critical business functions continue during and after a disaster.

Business Continuity vs Disaster Recovery

Business Continuity (BC) is the broader plan — how the organization maintains essential functions during and after a disaster. It covers people, processes, facilities, and technology.

Disaster Recovery (DR) is a subset of BC — specifically focused on restoring IT systems and data after a disruptive event. DR is about getting technology back online.

Both are documented in formal plans: BCP (Business Continuity Plan) and DRP (Disaster Recovery Plan). These plans are tested through tabletop exercises (discussion-based) and full-scale drills (operational).

  • BC: keeps the business running during a disruption (broader)
  • DR: restores IT systems after a disaster (technology-focused)
  • BCP and DRP should be tested regularly through exercises
  • Plans must be reviewed and updated annually

RTO and RPO

Recovery Time Objective (RTO) — the maximum acceptable time a system can be unavailable after a disaster. For example, an RTO of 4 hours means the system must be restored within 4 hours. RTO drives decisions about redundancy and failover.

Recovery Point Objective (RPO) — the maximum acceptable data loss measured in time. For example, an RPO of 1 hour means up to 1 hour of data could be lost. RPO drives backup frequency.

Both are determined during the Business Impact Analysis (BIA). Shorter RTO/RPO mean higher cost (more infrastructure, more frequent backups). The BIA helps balance cost vs recovery requirements.

  • RTO = time to restore (how long can you be down?)
  • RPO = acceptable data loss (how much data can you lose?)
  • RTO drives infrastructure decisions (hot site, cold site)
  • RPO drives backup strategy (hourly, daily, real-time replication)
  • Shorter RTO/RPO = higher cost

Backup Strategies

Full Backup — copies all data. Slowest to create, fastest to restore. Large storage requirement.

Incremental Backup — copies only data changed since the last backup (full or incremental). Fastest to create, but restore requires the last full backup + all incrementals in sequence.

Differential Backup — copies data changed since the last full backup. Larger than incremental but faster to restore (only need the full + latest differential).

The 3-2-1 Rule: keep at least 3 copies of data, on 2 different media types, with 1 copy offsite. Offsite backups protect against physical disasters like fire or flood.

Backup destinations: disk (local NAS), tape (long-term archival), cloud (S3, Azure Blob), or hybrid.

  • Full: complete copy — slow backup, fast restore
  • Incremental: changes since last backup — fast backup, slow restore
  • Differential: changes since last full — medium backup, medium restore
  • 3-2-1 Rule: 3 copies, 2 media, 1 offsite
  • Test restores regularly — backups are useless if you can't restore

Recovery Sites

Hot Site — a fully operational duplicate of the primary site. Has all equipment, data, and personnel ready to take over immediately. RTO of minutes to hours. Most expensive option.

Warm Site — partially configured with some equipment but not fully operational. Data may be slightly out of sync. RTO of hours to days. Balance of cost and speed.

Cold Site — just the facility (power, cooling, cabling) but no equipment. Everything must be procured and configured after a disaster. RTO of weeks. Cheapest option.

Other considerations: mobile sites (temporary facilities like a portable datacenter in a trailer) and cloud-based recovery (failover to cloud infrastructure).

  • Hot site: fully operational, take over immediately — highest cost
  • Warm site: partially configured, needs some setup — moderate cost
  • Cold site: empty facility, full setup required — lowest cost
  • Cloud DR: replicate to cloud provider, fail over when needed
  • Site selection: must be geographically separated from primary site

High Availability and Failover

High availability (HA) eliminates single points of failure through redundancy. Common HA techniques: load balancing (distributing traffic across multiple servers), clustering (multiple servers acting as one), redundant power supplies, RAID (disk redundancy), and redundant network paths.

Failover — automatic switching to a redundant system when the primary fails. Active-Passive (standby system takes over) vs Active-Active (all systems handle traffic, if one fails others absorb the load).

Failback — returning to the primary site after it's restored and verified.

  • HA eliminates single points of failure through redundancy
  • Load balancers distribute traffic and provide automatic failover
  • RAID 1/5/6/10 provide disk redundancy
  • Active-Passive: standby takes over on failure
  • Active-Active: all systems active, failure means reduced capacity but no downtime

Exam Tip

RTO vs RPO is a guaranteed exam question. RTO = time to restore (uptime), RPO = data loss (backups). Hot/warm/cold sites: know the cost and recovery time trade-offs. 3-2-1 backup rule and RAID levels are frequently tested.