Disaster Recovery and High Availability Concepts

# Disaster Recovery Concepts [[Managing Service and Data Availability]] ## High Availability - Availability and high availability - Percentage uptime - Maximum Tolerable Downtime (MTD) -> availability, 99.99% of time. (UPTIME) - Recovery Metrics - Recovery Time Objective (RTO) -> how long to bring new system online - Work recovery time (WRT) -> how long it takes for the clients of (server) to be able to use it for work again - Recovery Point Objective (RPO) -> how much backup do I lose if I lose a machine (how often are backups) ## Fault Tolerance and Redundancy - Reliability metrics - Mean Time Between Failures (MTBF) -> expected lifetime of a product - `operation time / # of failures` - Mean Time to Failure (MTTF) -> similar to MTBF but for non-repairables - used for non-repairable components - `Operational time / # of devices` - Mean Time to Repair (MTTR) -> how long it takes to repair and recover from incident - `unplanned maintenance time / # of incidents` - Redundant system types - Hardware spares, network links, power systems, systems and data backups, cluster services, etc... ## Recovery Sites - Alternate processing sites that will not be affected by the same disaster event - Hot site - Failover in seconds or minutes - live site - Warm site - Failover in hours - may need to update databases and such - Cold site - Failover in days - chairs and power - Cloud site - Transfer responsibilities to cloud provider - Cannot transfer all the risk ## Facilities and Infrastructure Support - Heating, ventilation, air conditioning (HVAC) - Temperature sensors and moisture detection sensors - Office areas vs datacenter/equipment rooms - Fire suppression - Emergency procedures and alarms - Portable extinguisher usage - Sprinkler systems ## Power Management - Spikes, surges, brownouts, and blackouts - Power Distribution Unit (PDU) - Filter and stabilize grid power and facilitate remote monitoring - Battery backups and uninterruptible power supplies (UPSs) - batter-backed cache - UPS runtime - Generator - Replacement for grid power - must be used with UPS - Renewable power sources ## Network Device Backup Management - Network appliance configuration backup - Startup vs running config - Version history and rollback - Backup modes - State/bare metal - Config file - Backing up/logging other state information # High Availability Concepts ## Multipathing - Multiple physical links between nodes - Routed internetwork - SAN multipathing - Multiple ISPs - Diverse paths - Ensure physical separation of first mile links to ISPs - Ensure independence of ISP's networks ## Link Aggregation/NIC Teaming - Bundle multiple physical links into a single channel - Channel can use combined bandwidth of links - Channel redundancy against link failure - IEEE 802.3ad/802.1ax - Link aggregation group (LAG) - Link aggregation control protocol (LACP) ## Load Balancers - NO PROCESSING - LAYER 4 -> IP and priority of packet - LAYER 7 -> WAF - Distribute client requests - place in front of a server farm or resource pool ## Redundant Hardware/Clusters - Nodes that must share common data - Virtual IP - External address for service shared by processing nodes - Common address Redundancy protocol (CARP) - Active-passive - active-active clustering -> both share responsibility ## First Hop Redundancy - Provision multiple default gateways without complex routing on hosts - Hot standby router protocol (HSRP) - Routers share common virtual IP and MAC - Standby group with priority standby router - Virtual Router Redundancy Protocol (VRRP) - No Specific Standby