High Availability Approaches for Databases ( RDS, Aurora)

Overview

With the AWS Aurora architecture, there is a clear separation between storage and computing. The AWS Aurora offers features like high availability that could be applied to the data in the DB cluster. With that being said, the data is also safely stored even if any or all of the instances in the DB cluster make it unavailable due to unpredictable activities. All the features offered by AWS Aurora help to ensure that at least one or more DB instances are always ready for handling the 'n' number of database requests straight from the application.

What is High Availability?

When we say a service is highly available, what does it mean? We shall be discussing that in this section of the article. A service is said to be highly available when the service is available to the end users even if it gets hit by any unpredictable catastrophic event occurring which could cause the geography to go down and interrupt the seamless experience for the users. When we talk about Availability, it generally has two aspects to it: the periods when the service is accessible for the amount of time, and how much time the system requires to respond to user requests. Hence, high availability describes the systems that provide a high level of operational performance along with quality over a significant period.

A system is defined as highly available depending upon the calculation of the percentage of uptime it offers. The most popular held standard which might be quite difficult to achieve for a system to be called high availability is known as the 'five-nines availability'. The 'five-nines availability' defines the system or product as being available 99.999% of the time to the users. This percentage is mathematically calculated as the number of times the “nines” you see for each service, indicating the rate of high availability. The highly available systems are considered reliable as they continue operating even when critical components fail or malicious attacks occur these systems help to offer a seamless experience to the users.

The below table shows how the number of nines and how the downtime of minutes affect:

AVAILABILITY	99%	99.90%	99.99%
Daily	14m 24s	1m 26.4s	8.6s
Weekly	1hr 40m 48s	10m 4.8s	1m 0.5s
Monthly	7hr 18m 17.5s	43m 49.7s	4m 23s
Year	3d 15hr 39m 29.5s	8h 45m 57s	52m 35.7s

Let us consider an example if a service is showing a high availability of four nines which means that the system is at 99.99% of the time available, while for only 52.6 minutes during a year it remains down. As seen difference between the 99.99% availability and 99.9% availability shows an additional 7 hours of acceptable downtime is experienced for that service. This further means that 1 minute and 26 seconds of downtime can be observed every day, or 10 minutes and 48 seconds every week. Hence, it might seem the difference between four 9's and three 9's is similar and just a matter of one nine. But in reality, it is costing a total of acceptable 8 hours of downtime.

Resilience is an important measure of high availability for service. A resilient service can be defined as the ability to simply handle any failure without service disruption or data loss, along with seamless recovery from any unpredictable failures.

Some of the elements that contribute and together make a system highly available are listed below:

Failover— Capability that a service can switch from an active system component to a redundant component whenever malicious failure, imminent failure, performance degradation, or functionality degradation happens.

Redundancy— To make sure that a critical system component is having another identical component containing the same data, which can eventually take over in case of downtime or malicious failure.

Failback— We describe failback as the ability of a primary active component can switch back from a redundant component whenever it recovers from failure.

Monitoring— To have continuous and regular checking over the system for identifying problems in production systems which if left unchecked may lead to disruption or degraded service.

High Availability for Aurora Data

When we talk of achieving high availability for Aurora data, we must make a note that with AWS Aurora you can seamlessly store multiple data copies in a DB cluster across various Availability Zones in a single AWS Region. The copies are stored via the AWS Aurora regardless of whether the instances in the DB cluster span across various Availability Zones or not.

Aurora helps to synchronously replicate the data across Availability Zones. This replication happens across six storage nodes that are linked with the DB cluster volume when the data is written to the primary DB instance. Eventually, this helps to achieve and offer data redundancy, minimizing the latency spikes when the system backups. To enhance the availability whenever any planned system maintenance, you can run an instance that provides high availability offering enhanced availability. This also helps to protect the databases against any malicious failure and disruption in the availability zones.

High Availability for Aurora DB Instances

Talking about High availability for AWS Aurora DB instances it must be noted that up to 15 read-only Aurora Replicas can be seamlessly created for a cluster that is implementing single-master replication, once a primary instance is created. These Aurora Replicas are widely popular as the reader instances.

While performing your daily operations with the AWS services, especially AWS Aurora, you can also offload certain parts of the work for any read-intensive applications via creating these AWS aurora powered reader instances, so that you can directly start executing your SELECT queries to rete the data as per your requirements. When any possibility of an issue can affect the primary DB instance, then it is one of the AWS Aurora reader instances that takes over this primary instance where the mechanism is termed as the failover. Various AWS Aurora features can be applied to the failover mechanism. While implementing the connection string which is intact equally when even a failover hit, then this instance is promoted to a new primary instance, where you can instantly connect with the DB cluster endpoint. This cluster endpoint is always represented by the present primary instance in the DB cluster.

Let us consider an example where the AWS Aurora detects database problems and activates the failover mechanism automatically. The time for failover to compete can also be reduced with AWS Aurora. With this, you can effectively minimize the time during a failover when the database is unavailable for writing.

Note: Availability Zones (AZs) are distinct locations to provide ultimate isolation in case of any random outages. It is recommended to distribute the primary instance and other reader instances in the DB cluster across various Availability Zones. This helps to improve the availability of the DB cluster which can help to elevate any issue which affects the Availability Zone and can be prevented causing an outage to your cluster.

This setting can be done by simply setting up a Multi-AZ cluster while creating the cluster via the AWS Management Console, the AWS CLI, or the Amazon RDS API. While for the existing AWS Aurora cluster to convert into a Multi-AZ cluster you can add a new reader instance by specifying a different Availability Zone to it to prevent any outages.

High Availability Across AWS Regions with Aurora Global Databases

AWS Aurora global databases can be set up if your requirement evolves around achieving high availability across multiple AWS Regions. With Aurora, the data replication is handled automatically along with updates getting set from the primary AWS Region to every one of the secondary Regions. The Aurora global database spans various AWS Regions for each of the Aurora global databases. This helps to enable the low latency global reads as well as disaster recovery across an AWS Region from outages.

Fault Tolerance for an Aurora DB Cluster

By default as per design, the AWS Aurora DB cluster is fault tolerant. As the cluster volume is spread across various Availability Zones (AZs) of a single AWS Region, where every Availability Zone has a copy of the cluster volume data makes the AWS Aurora fault tolerant in case of any accidents. With this functionality, the DB cluster can easily tolerate an Availability Zone failure without incurring any loss of data but just a brief interruption of service can be experienced by the users.

The AWS Aurora replica is promoted to the primary instance when the DB cluster has either one or more than one Aurora replica when a failure event arises. As a little interruption is experienced during this failure event, the write or the read operations could also fail to throw an exception. Whereas the service is up within 120 seconds, or sometimes it can be as less as 60 seconds. It is recommended to build at least one or more Aurora Replicas to offer high availability of the DB cluster in two or more two distinct Availability Zones. It is always recommended to promote an AWS Aurora Replica to the primary instance rather than creating a new primary instance.

There are two ways via which the AWS Aurora automatically failover when the primary instance in a DB cluster uses single-master replication:

Creation of a new primary instance.
Promoting an existing Aurora Replica to the new primary instance.

The customization for the scenarios, when the AWS Aurora Replicas get promoted to the primary DB instance, is possible in case of failure. This can be done by easily assigning every Aurora replica a priority where the priorities range starting from '0' (representing the priority) to '15' (representing the last priority). Now when the primary DB instance disrupts or fails, then the AWS RDS promotes the Aurora Replica with better priority to the new primary instance. This priority of the AWS Aurora can be modified as per requirements where no failover is triggered while doing so. Also, more than one AWS Aurora replica can have a similar priority which results in promotion tiers. When this happens, then the AWS RDS promotes the AWS Aurora replica which is having largest. And when two or more AWS Aurora Replicas have the same priority as well as size, then the AWS RDS promotes the arbitrary replica of the same promotion tier.

Note: For AWS Aurora MySQL 2.10 (and higher version), the availability is improved during a failover with more than one reader DB instance in a DB cluster. The AWS Aurora only restarts the writer DB instance where only the reader DB instances remain available so that it can continue processing the queries to the reader endpoint.

High Availability with Amazon RDS Proxy

If you want your application to transparently tolerate any sort of database failure, implementing it with AWS RDS Proxy is the answer you need. You don't need to write complex failure-handling code to handle any of the database application failures. With AWS RDS Proxy, it's easier to bypass Domain Name System (DNS) caches, which eventually helps to reduce the failover times by up to 66% for Aurora Multi-AZ databases. The application connections are automatically preserved as the AWS RDS proxy automatically routes traffic to a new database instance.

Aurora High Availability Strategies

Under the AWS RDS suite, you get the AWS Aurora as a PaaS service which is a fully managed relational database management system (RDBMS) offering two compatibles that are, MySQL and Postgres. You might be thinking about how the High Availability strategies are affected by AWS aurora. To answer this, we need to understand that with AWS Aurora, AWS always keeps the storage and database engine separated based on the Separation of Concerns Principle.

Separation of Concerns Principle: An RDBMS run on the available hardware ensure that the implementation work within the limitations posed by the OS along with the hardware for accomplishing the features that are expected from a modern database. These features could be ACID-compliant transactions, replication, DML and DDL processing, high availability (HA), as well as fault tolerance. When the implementation runs only in an environment that is specifically designed for a database, the different responsibilities are separated into layers, that allow the database engine along with the storage engine to remain focused and specialized. This results in high availability in addition to improved performance – up to 5x more performance than stand MySQL and up to 3x more than standard PostgreSQL.

With that being said, you might be thinking then how does a specialized database and storage engine offer high availability? Multiple ways to achieve high availability are offered as we shall be discussing in this section of the article, let us have a glance at how the AWS Aurora storage engine works for deriving high availability from its separation.

Once the data is written to the AWS Aurora storage engine, it is this engine that makes sure that the data is consistently, correctly, and durably written. The data is written in at least two locations of each of the three availability zones, which makes it a total of six different places. Now to maintain that all these complexities are correctly maintained, it is the storage engine that takes care of it. The database engine could also opt for a shoot-and-forget and don't have to worry if the data was written, transaction logs, the possibility of needing to recover, etc of it.

Now we shall be deep diving into how separating the storage concern from the database engine, helps to avail the high availability strategies:

Read Replicas

Read replicas in AWS Aurora helps to grant only read-only access to the master which is the same storage. This ensures that very low latency replication happens from the time the data is written till the time it is visible to the AWS Aurora Read Replica. In a scenario where multiple AWS Aurora Read Replicas are required, all Read Replicas are pointed to the same data, which eventually removes any overhead and complexity that might come on the master. It is this feature of Aurora Read Replica that takeover over immediately when the master fails. AWS gracefully handles this situation by conveniently replacing the failed instance, to meet provisioned capacity which might be a common scenario where the read-to-write ratio is consistently high. It is recommended to provision the read replicas equivalent to the master to improve the write high availability. Then, you may set this AWS Aurora Read Replica in the priority so that it can opt as the primary instance which shall help you to concretely maintain high availability in cases of failure for both read and write capacities.

Auto Scaling

Another effective High availability strategy is Autoscaling with AWS Aurora. We define autoscaling as the ability to add more instances that are widely popular as horizontally scale out or when the traffic reduces you can conveniently remove instances that are scaled in for only the set of available servers. This concept is widely known when you are dealing with AWS EC2 instances, but let me tell you that you also get this elasticity with your AWS Aurora Read Replicas. For handling your base load as you can configure the minimum number of read replicas according to your requirement. Also, you can adjust your autoscaling policies for adding or removing instances as per the traffic spikes or not.

Note: The availability of the ability to leverage the capability of elasticity is the outcome of keeping the storage engine separate from the database engine.

Let us quickly understand the concept of autoscaling via an example of the B2B e-commerce industry. When the traffic on the e-commerce site increases during business hours on weekdays whereas the same traffic gets a little overnight on weekends. With a high read-to-write ratio, the images and content descriptions can be distributed locally as well as benefit from both the Autoscaling and Read Replicas. Additionally, you can save on costs during off hours while autoscale the AWS Aurora Read Replicas when the traffic increases to meet demand.

Cross-Region Replication

When scenarios demand availability at a higher level always opt for Cross Region Replication strategies for AWS Aurora which replicate it to another region. With that being said, we make sure that with cross-region replication the data is made available to be read either locally or utilized whenever the primary region faces any failures. When the replication happens across regions, the instances are treated as a Read Replica which is available in the secondary region. There aren't multi-master replicas, but as the capability to have their Read Replicas. It also can be set as the master whenever a failure arises at the primary region.

Cross-region replication finds a lot of use cases, where the data is holding much more important for the organization and is to be stored in a single geographic region. With business continuity, regulatory, or even scenarios like financial reasons, this is done. Another reason for the coss region replication can also be reduced latency between the database backend and the end users. As we see in the e-commerce industry, there can be changes made at images as well as descriptions level for the master of that single region whereas the distribution of the services can be done globally so that the read is distributed locally.

External Replication

Let us talk about the high availability strategies for scenarios where we require data outside of the AWS Aurora cluster. There sometimes arise scenarios where the reports have to be executed and process the data outside of the AWS Aurora cluster. Well, this case can be managed by utilizing the concept of External Replication.

External replication is an easy process where the AWS Aurora replicates the data either to an AWS EC2 instance running MySQL, an AWS EC2 static MySQL instance, or on a MySQL instance that is in process in the corporate data center. With the one-way replication, it is made sure that the data is available even outside the AWS Aurora as the sync is always up to date with the AWS Aurora cluster.

Serverless

One of the significant developments offering high availability as well as scalability is the serverless features of AWS RDS. Serverless can be defined as the process where the servers are still present but as the user, you won't have to take on the hassle of scaling, provisioning, configuration, or maintenance offered for any servers. An additional proxy layer is added on top of the existing layers which are, the database engine and storage engine.

When the first query is received, the proxy fleet(maintained by AWS as proxy servers) receives the incoming request from where a request is made for the instance from the warm pool (a DB capacity that serves the request). From there the request is forwarded after which it’s allocated for utilization. The instances are elastic and can efficiently scale in and out depending on demand and traffic spikes where the users get the control to specify the limits in the Aurora Serverless configuration.

RDS High Availability Strategy

Let us briefly discuss the AWS RDS high availability strategies offered for Multi-AZ deployments.

We know that the Multi-AZ deployments have either one standby or two standby DB instances, where the deployment has one standby DB instance, It offers failover support and it's termed as Multi-AZ DB instance deployment but it doesn't serve any read traffic. Whereas when the deployment has two standby DB instances, it's termed as a Multi-AZ DB cluster deployment which offers both failover support as well as serves read traffic efficiently. It's on you if you want to opt for Multi-AZ DB instance deployment or a Multi-AZ DB cluster deployment via the AWS management console via the DB identifier on the databases option.

Listed below are a few of the Multi-AZ DB instance deployment features:

Only one row for the DB instance.
Role is set as Instance or Primary.
Multi-AZ value is set to Yes.

Listed below are a few of the Multi-AZ DB cluster deployment features:

Works at cluster-level row having three DB instance rows under it.
With the cluster-level row, where the Role's value is Multi-AZ DB cluster.
With each instance-level row, the Role's value is Writer or Reader instance.
With each instance-level row, the Multi-AZ value is set to 3 Zones.

Conclusion

You can also offload some of the work for the read-intensive applications via creating these AWS Aurora-powered reader instances so that you can directly fir your SELECT queries to rete the data as per your requirements.
If you want your application to transparently tolerate any sort of database failure, implementing it with AWS RDS Proxy is the answer you need. You don't need to write complex failure-handling code to handle any of the database application failures.
Aurora helps to synchronously replicate the data across Availability Zones. This replication happens across six storage nodes associated with the cluster volume when the data is written to the primary DB instance. Eventually, this helps to achieve and offer data redundancy, minimizing the latency spikes when the system backups.
By simply setting up a Multi-AZ cluster while creating the cluster via the AWS Management Console, the AWS CLI, or the Amazon RDS API. While for the existing AWS Aurora cluster to convert into a Multi-AZ cluster you can add a new reader instance by specifying a different Availability Zone to it to prevent any outages.

High Availability Approaches for Databases ( RDS , Aurora)

Overview

What is High Availability?

High Availability for Aurora Data

High Availability for Aurora DB Instances

High Availability Across AWS Regions with Aurora Global Databases

Fault Tolerance for an Aurora DB Cluster

High Availability with Amazon RDS Proxy

Aurora High Availability Strategies

Read Replicas

Auto Scaling

Cross-Region Replication

External Replication

Serverless

RDS High Availability Strategy

Conclusion