Aurora Fault Tolerance and Self Healing (Aurora)

Overview

AWS Aurora offers the capability of the system being fault-tolerant as well as self-healing as it automatically divides the database volume across multiple disks which are then replicated in six ways across three Availability Zones. It transparently handles any data loss with up to two copies of data which does not affect the write availability to the database and up to three copies which doesn't affect the read availability. AWS Aurora storage is also capable of self-healing as the data blocks as well as the replicated disks are automatically and continuously scanned for catching any errors which can affect the system in the future hence if an error is found its repaired automatically as soon as possible.

What Do Fault Tolerance and Self-Healing Mean?

Before we understand how AWS Aurora offers fault tolerance as well as a self-healing mechanism, let us briefly understand what fault tolerance and self-healing mean.

Fault Tolerance: We can define fault tolerance as the upgraded version of high availability, which offers capability when a system remains operational and working at zero downtime and without any data loss when the failure event disrupts. This can be achieved with various methods where the most popular in AWS is when the instances are hosted or replicated across two or more independent sets of servers. This helps to achieve the goal of always providing a seamless experience to the users by keeping the applications hosted at zero downtime. The design of the fault-tolerant architecture is much more complex offering a higher redundancy which helps to sustain any unpredicted fault that might arise in any one of the components.

Architecture that is fault-tolerant can tolerate any failure or fault of the component so that no impact happens on the performance or any data loss, or as wide as no system crash occurs. The caution with fault tolerance is that, if your organization wants to implement a fault-tolerant system, then it shall cost for all the operating expenses that shall be incurred for keeping the systems running on numerous resources.

Self Healing: When we say an architecture or a service is self-healing we mean that when the service is keeping up with its operations running seamlessly at a minimum downtime along with no human intervention, is what is meant by a self-healing architecture or service. Auto-healing is the temporary solution in which the services stay operational until we have the resources made available for tackling what was the root cause of the instance failure.

We can say that AWS Aurora is both a Fault-Tolerant and Self-Healing Storage mechanism as each 10GB of the database volume gets automatically replicated in six ways spread across three Availability Zones. As Aurora transparently handles the data loss without affecting the write (creating up to two copies of data) and read availability (creating up to three copies of data) making it a fault-tolerant service. Also, the data block and disk are getting continuously scanned where they are searching for any errors, and getting them corrected automatically is what makes AWS Aurora a self-healing service.

Aurora Fault Tolerance

When we say AWS Aurora fault tolerance, we are basically talking about how resilient AWS Aurora is.

As the entire AWS global infrastructure consists of the AWS Regions and Availability Zones where it is this AWS region that offers the various physically separated as well as isolated Availability Zones. These are connected with features such as high throughput, low latency, as well as highly redundant networking. It is a good idea to have multiple Availability Zones which are designed and operated in a way that the databases, as well as applications, can failover automatically between the Availability Zones without any interference or interruption.

With Availability Zones, you get fault-tolerant, highly available, as well as scalable than any other traditional data centers (single or multiple data infrastructures). Also, with that AWS Aurora provides capabilities that support data resiliency and backups as per your requirements to enhance fault tolerance.

Backup and Restore

We all know that taking a backup of important data is so useful in case an unpredicted failure occurs, as the data still remain protected. With AWS Aurora, you can automatically back up the DB cluster volume as well as retain this restored data stretching to the length of the backup retention period. The backup retention period can be customized and set, from 1 to 35 days, whenever you are creating a new or modifying any existing DB cluster. Also, if scenarios arise where you need to retain the backup beyond the backup retention period, AWS Aurora offers to do so as well. You can simply capture the Db snapshot of the data in the specified DB cluster volume. Aurora also supported retaining incremental restore data spanning across the entire backup retention period. Hence, with the DB snapshots captured, the data that you think might be needed beyond the backup retention period can be retained. A new DB cluster can any day be formed with this snapshot as well.

Now, with these AWS Aurora backups, you can quickly be restored at any point as these backups are continuous and incremental within the backup retention period. Also, your mind thinks about the time period when the backup is ongoing. Well, no impact on performance is seen or any sort of interruption in the database service happens when the backup data is simultaneously getting written.

A new AWS Aurora DB cluster can be created from the backup data which can help to recover the data, or this can also be retained from a DB cluster snapshot that was captured. New copies of a DB cluster can any day be created from any backup data in the backup retention period time. No frequent checks have to be maintained to take frequent snapshots of your data as this is automated with the continuous and incremental feature of AWS Aurora which helps to improve restore times.

Replication

Now let us, deep dive, into how replication helps in falt to live in AWS Aurora. As we studied, the AWS Aurora Replicas is the best-implemented choice when scaling read operations and high availability in the scenarios. AWS Aurora replicas can be defined as the independent endpoints of an Aurora DB cluster. Multiple copies of the data in the DB cluster comprise the DB cluster volume in AWS Aurora. Here, the DB cluster that spans within an AWS Region, can offer up to 15 AWS Aurora Replicas which can be easily distributed across the Availability Zones.

However, the data residing in the DB cluster volume is the single and logical volume represented by the primary DB cluster instance as well as the AWS Aurora Replicas residing in the DB cluster. The AWS Aurora replicas are promoted as the primary DB instance whenever a failure occurs in the original primary DB instance. Replication options are supported by the AWS Aurora that must be defied for the Aurora MySQL and Aurora PostgreSQL.

Failover

We define a failover as the process of redirection of traffic spikes from the primary server to a secondary server in order to maintain the seamless experience of the users.

With that being said, with AWS Aurora you can easily store multiple copies of the data present in the DB cluster across various Availability Zones limited to a single AWS Region. It is important to note, that the storage of the data copies shall occur irrespective of whether the DB instances present in the DB cluster span various different Availability Zones or not. The Aurora Replicas are automatically provisioned and synchronously maintained by the AWS Aurora across the Availability Zones. To offer data redundancy, prevent I/O freezes, and minimize the latency traffic spikes whenever there is a system backup going on, the primary DB instance synchronously replicates across the Availability Zones to the AWS Aurora Replicas. High availability of a working DB cluster can help to enhance the availability even when planned system maintenance is going on which helps to offer a seamless experience as well as protects the databases against any failure and disruption in the Availability Zone.

Database Cloning

A clone in AWS Aurora is created with the copy-on-write protocol which is a mechanism where less additional space is needed for creating an initial clone. The first copy of the clone is kept as a single copy where the data is utilized by the source AWS Aurora DB cluster along with the new (cloned) AWS Aurora DB cluster. All the extra additional storage is assigned when any changes made to make data are reflected from the source Aurora DB cluster or the Aurora DB cluster clone side. More than one clone can be created from the same AWS Aurora DB cluster, while you also get the capability to create various clones from another clone too. For scenarios where you want to set up test environments that shall be implementing all your production data, without taking any risk of data corruption.

Many applications of data cloning are observed as listed below:

Scenario where you need to experiment with any potential changes such as schema or parameter group changes, can be utilized to assess any impact.
For testing, production, or any other, you can create a copy of the production DB cluster.
Running heavy workload-intensive operations, like exporting the critical data or parallel running any analytical queries over the clone DB.

Once the clone is created, you can also figure out a configuration suiting your requirements and manage the Aurora DB instances differently from how the source Aurora DB clusters are utilized. Once you are finished using the cloned cluster, you can delete it at any time.

Consider an example where the utilization of the clone for any testing purposes is the same as the source production Aurora DB cluster, then you can simply manage the setting and configure it with a single Aurora DB instance rather than any multiple DB instances implemented by the Aurora DB cluster.

Self-Healing (Aurora)

When we say an architecture or a service is self-healing we mean that when the service is keeping up with its operations running seamlessly at a minimum downtime along with no human intervention, that is what is meant by a self-healing architecture or service. With Amazon Aurora storage being self-healing, we can understand that the data blocks and the disks which were replicated across AZs are in a continuous automatic scan for catching errors that might affect the system in the future. When any such error is captured, it immediately repairs it automatically as soon as possible, making it s reliable AWS service.

As Aurora is durable, reliable(self-healing), and fault-tolerant, as users we get the capability to architect your AWS Aurora DB cluster for improving its high availability by adding a few more AWS Aurora Replicas, spreading these in various Availability Zones, or availing various automatic features which helps to the make AWS Aurora a on stop reliable database solution.

We shall now deep dive into three topics as listed below to understand the self-healing concept of AWS Aurora more in detail.

Storage auto-repair
Survivable page cache
Crash recovery

Storage Auto-Repair

As AWS Aurora is capable of maintaining various copies of the data in the DB cluster across three different Availability Zones, with this we reduce the chances of losing the important data even if any unpredicted disk failures occur. Failures are automatically detected with AWS Aurora if any in the disk volumes which make up the entire DB cluster volume. These errors are automatically repaired where the disk segment is getting repaired it shall use the data in the other volume which makes up the whole DB cluster volume. This helps to make sure that the present day is present in the repaired disk segment. Hence, with that being said AWS Aurora helps to prevent data loss as well as reduce the dependency on performing a point-in-time restore for recovering from a disk failure.

Survivable Page Cache

We can define a page cache in AWS Aurora, also known as InnoDB buffer pool for AWS Auror MySQL and as a buffer cache for AWS Aurora PostgreSQL, which is each DB instance's page cache which is managed as a separate process allowing the page cache to survive independently on the database.

As the age cache is in memory, during any failure events this page cache helps to keep current data pages "warm" as the page cache when the database is restarted. A performance gain is experienced as the need for the initial query execution is bypassed from performing any read I/O operations.

The following page cache behavior can be observed for Aurora MySQL:

No page caches are listed when the reader DB instances do not reboot but the writer DB instance does.
Page cache is lost, if the reader DB instances reboot, as well as the writer DB instance, reboots.
When only the reader DB instance reboots, the page caches survive for both ( the writer and reader DB instances).
The scenario when the writer instance reboots s similar to the DB cluster failing over. Here, in the new writer DB instance, the page cache is maintained as it was previously in the reader DB instance. But, for the reader DB instance, the page cache is not maintained which was previously the writer DB instance.

For 2.10 and earlier versions: The page cache on the writer instance survives when the writer DB instance reboots whereas the reader DB instances lose their page caches.

For 2.10 and higher versions: The writer DB instances can be seamlessly rebooted without even rebooting the reader DB instances.

Let us understand the scenario with AWS Aurora PostgreSQL. We can utilize the DB cluster cache management for preserving the page cache for the specified reader DB instance as it is promoted as the writer DB instance once failover happens.

Crash Recovery

As studied so far, we know that AWS aurora is designed to instantaneously recover from any crash and immediately continue to serve the application data without any binary logging. This crash recovery is asynchronous on parallel threads, which makes your database available and open immediately after any crash.

Listed below are the key considerations for binary logging as well as crash recovery when working with AWS Aurora MySQL:

If you enable binary logging, then it forces you to forcefully perform the binary log recovery, which directly affects the recovery time after the crash.
The size and efficiency of logging are affected by the type of binary logging implemented. Some formats log more amount of information than binary logs for the same amount of database activity.
When more data is logged as binary logs, then the DB instance processes more data during the time of recovery, which eventually increases the recovery time.
With Aurora we can skip the binary logs for the purpose of data replication, to perform the point-in-time restore (PITR).

Conclusion

AWS Aurora provides capabilities that support data resiliency and backups as per your requirements to enhance fault tolerance.
The Aurora Replicas are automatically provisioned and synchronously maintained by the AWS Aurora across the Availability Zones.
You can simply capture the Db snapshot of the data in the specified DB cluster volume. Aurora also supported retaining incremental restore data saying across the entire backup retention period.
AWS Aurora storage is also capable of self-healing as the data blocks as well as the replicated disks are automatically and continuously scanned for catching any errors which can affect the system in the future hence if an error is found it is repaired automatically as soon as possible.
AWS Aurora is designed to instantaneously recover from any crash and immediately continue to serve the application data without any binary logging. This crash recovery is asynchronous on parallel threads, which makes your database available and open immediately after any crash.