Breaking Down Disaster (Part II): Architecture of Disaster Recovery Strategies on AWS

Disaster recovery (DR) strategies involve the architectural design of systems and processes to ensure that critical IT infrastructure and data can be recovered efficiently after a disruptive event. Here's an overview of common architecture elements in DR strategies on AWS:

Backup and restore
Pilot light
Warm standby
Multi-site active/active

Backup and restore

Backup and restore DR architecture

Implement regular backups of critical data, configurations, and system images. Utilise services like AWS Backup, Amazon RDS snapshots, or Amazon EBS snapshots for efficient backup and restore processes.

Backups are created in the same region as the resources and are also duplicated to another region to ensure redundancy to protect against disasters. Hence, one can restore the infrastructure in the recovery region as region failover. Infrastructure as Code (IaC) such as AWS CloudFormation, AWS Cloud Development Kit (AWS CDK) or Terraform serves as enabler for you to deploy consistent infrastructure across different regions.

This backup and restore strategy is considered as the least efficient for RTO but it is the most cost effective solution available. You can further automate this strategy by using AWS resources such as Amazon EventBridge and Lambda to build a serverless automation which can improve the RTO.

Pilot light

Pilot light DR architecture

In AWS context, the pilot light strategy for Disaster Recovery (DR) is a concept that leverages cloud services to maintain a minimal but critical IT infrastructure that can be quickly scaled up in case of a disaster. The data is keep live (up-to-date or nearly up-to-date), but the services are idle.

By leverage on the basic infrastructure elements like Elastic Load Balancing and Amazon EC2 Auto Scaling, you can quickly scale up your workload and distribute traffic accordingly if there is disaster happens. Again, one can uses IaC to deploy the infrastructure required in order to keep your workload up and running. Like a pilot light in a furnace that cannot heat your house until triggered, a pilot light strategy cannot process requests until it is triggered to deploy the remaining infrastructure.

Here's how it applies specifically to AWS:

Minimal Environment:
- The pilot light environment in AWS consists of essential components required for the core functionality of critical applications. This might include a minimal set of EC2 instances, databases, and other key services.
Data Replication:
- AWS provides services for efficient data replication. For example, Amazon RDS (Relational Database Service) can be configured for automated backups and snapshots to ensure data integrity. AWS S3 can also be used for object storage with versioning.
Automation with IaC:
- AWS CloudFormation allows you to define and provision AWS infrastructure as code. Automation scripts can be created to quickly launch additional resources, replicate configurations, and scale up the environment during a DR event.
Scalability with Auto Scaling:
- AWS Auto Scaling can be utilized to automatically adjust the capacity of the pilot light environment based on demand. This ensures that the infrastructure scales up rapidly during a disaster to support the full application workload.
Cost Efficiency:
- The pilot light approach is cost-efficient in AWS since you are only paying for the minimal resources that are actively running during normal operations. Additional costs are incurred only when the environment is scaled up during a DR event.
Quick Recovery with AWS Lambda:
- AWS Lambda can be part of the automation process, allowing for serverless execution of code in response to events. This can be used to trigger actions during the failover process, contributing to quick recovery.
AWS Elastic Load Balancing (ELB):
- ELB can be employed to distribute incoming traffic across multiple instances, ensuring efficient load balancing. In the event of a disaster, ELB helps redirect traffic to the scaled-up resources.
AWS CloudWatch for Monitoring:
- AWS CloudWatch provides monitoring and alerting services. Regularly monitoring the pilot light environment ensures that any issues are detected promptly. Alarms can be set up to trigger automated responses.
AWS Route 53 for DNS Management:
- AWS Route 53 can be used for DNS management, allowing for the redirection of traffic during a failover. This is crucial for maintaining continuity and ensuring a smooth transition.
AWS Security Measures:
- Security best practices are integral to the pilot light strategy in AWS. Utilize AWS Identity and Access Management (IAM) for access controls, encrypt data in transit and at rest using AWS Key Management Service (KMS), and follow AWS security guidelines.
Regular Testing with AWS CloudFormation StackSets:
- AWS CloudFormation StackSets can be used for testing and deploying infrastructure changes across multiple accounts and regions. Regularly testing the failover process ensures that the environment is ready for a real disaster.
Documentation on AWS Well-Architected Framework:
- Align the implementation with the AWS Well-Architected Framework, which provides best practices across multiple pillars, including Disaster Recovery. Documentation should cover runbooks, procedures, and recovery plans.

The pilot light strategy in the AWS context allows organisations to achieve a balance between cost efficiency and rapid recovery, taking full advantage of the cloud's scalability and automation features.

Warm standby

Warm standby DR architecture

The warm standby strategy in the context of Disaster Recovery (DR) involves maintaining a secondary environment that is partially active and ready to take over operations quickly in the event of a disaster. This approach falls between the "hot standby" (fully active) and "cold standby" (completely inactive) strategies. In a warm standby setup, some components or services are running actively, reducing the time needed for a transition to a fully operational state.

This implies that the warm standby deployment can handle requests but at a reduced capacity. Before failover, the infrastructure must scale up to meet the production traffic/needs.

The warm standby strategy is suitable for organizations that require a balance between cost-effectiveness and quick recovery. It provides a middle ground where critical services are running actively but at a reduced capacity, enabling a rapid transition to full operation when needed.

Multi-site active/active

Multi-site active/active DR architecture

A multi-site active/active architecture refers to a setup where applications or services are deployed across multiple AWS regions or data centres, and all these regions are actively serving traffic simultaneously. This architectural approach is designed to enhance availability, reduce latency, and distribute the workload across different geographic locations.

In summary, a multi-site active/active architecture in AWS is a robust solution for organizations seeking high availability and fault tolerance. It maximizes the benefits of AWS's global infrastructure, allowing applications to remain accessible and responsive even during regional outages or disruptions. The architecture requires careful planning, security considerations, and ongoing management to ensure optimal performance.

Dev Speak