# Disaster Recovery Concepts
[[Managing Service and Data Availability]]
## High Availability
- Availability and high availability
- Percentage uptime
- Maximum Tolerable Downtime (MTD) -> availability, 99.99% of time. (UPTIME)
- Recovery Metrics
- Recovery Time Objective (RTO) -> how long to bring new system online
- Work recovery time (WRT) -> how long it takes for the clients of (server) to be able to use it for work again
- Recovery Point Objective (RPO) -> how much backup do I lose if I lose a machine (how often are backups)
## Fault Tolerance and Redundancy
- Reliability metrics
- Mean Time Between Failures (MTBF) -> expected lifetime of a product
- `operation time / # of failures`
- Mean Time to Failure (MTTF) -> similar to MTBF but for non-repairables
- used for non-repairable components
- `Operational time / # of devices`
- Mean Time to Repair (MTTR) -> how long it takes to repair and recover from incident
- `unplanned maintenance time / # of incidents`
- Redundant system types
- Hardware spares, network links, power systems, systems and data backups, cluster services, etc...
## Recovery Sites
- Alternate processing sites that will not be affected by the same disaster event
- Hot site
- Failover in seconds or minutes
- live site
- Warm site
- Failover in hours
- may need to update databases and such
- Cold site
- Failover in days
- chairs and power
- Cloud site
- Transfer responsibilities to cloud provider
- Cannot transfer all the risk
## Facilities and Infrastructure Support
- Heating, ventilation, air conditioning (HVAC)
- Temperature sensors and moisture detection sensors
- Office areas vs datacenter/equipment rooms
- Fire suppression
- Emergency procedures and alarms
- Portable extinguisher usage
- Sprinkler systems
## Power Management
- Spikes, surges, brownouts, and blackouts
- Power Distribution Unit (PDU)
- Filter and stabilize grid power and facilitate remote monitoring
- Battery backups and uninterruptible power supplies (UPSs)
- batter-backed cache
- UPS runtime
- Generator
- Replacement for grid power
- must be used with UPS
- Renewable power sources
## Network Device Backup Management
- Network appliance configuration backup
- Startup vs running config
- Version history and rollback
- Backup modes
- State/bare metal
- Config file
- Backing up/logging other state information
# High Availability Concepts
## Multipathing
- Multiple physical links between nodes
- Routed internetwork
- SAN multipathing
- Multiple ISPs
- Diverse paths
- Ensure physical separation of first mile links to ISPs
- Ensure independence of ISP's networks
## Link Aggregation/NIC Teaming
- Bundle multiple physical links into a single channel
- Channel can use combined bandwidth of links
- Channel redundancy against link failure
- IEEE 802.3ad/802.1ax
- Link aggregation group (LAG)
- Link aggregation control protocol (LACP)
## Load Balancers
- NO PROCESSING
- LAYER 4 -> IP and priority of packet
- LAYER 7 -> WAF
- Distribute client requests
- place in front of a server farm or resource pool
## Redundant Hardware/Clusters
- Nodes that must share common data
- Virtual IP
- External address for service shared by processing nodes
- Common address Redundancy protocol (CARP)
- Active-passive
- active-active clustering -> both share responsibility
## First Hop Redundancy
- Provision multiple default gateways without complex routing on hosts
- Hot standby router protocol (HSRP)
- Routers share common virtual IP and MAC
- Standby group with priority standby router
- Virtual Router Redundancy Protocol (VRRP)
- No Specific Standby