Highly Available Databases
Highly Available Databases

Availability in database refers to availability of services to database clients. High availability places more stringent demands on the system -- requiring minimum levels of service and robustness in the face of failures. A large spectrum of applications need higher levels of availability from their DBMSs, including real-time and embedded systems, web applications and other types of online systems.

Non-database servers often consider availability to be up-time, but database servers have a deeper concern, the data itself. A database server has to ensure that its data is up-to-date and available for all clients. The integrity of the database must be intact with all committed changes applied, and there can be no loss of data security.

A robust DBMS can handle normal failures (like, power loss) and preserve data integrity, at the loss of database availability during recovery. However, high availability demands may necessitate recovery from more severe failures, such as media (failure) and network failure. Some situations need reduced or no downtime for recovery from failures.

High availability requirements include,

Fail Points

A fail point is a component in the system that can fail independently of other components. The major fail points in a database server are:

Each component has its own type of failure or service reduction and thus specific remedies. Some solutions only support recovery and service maintenance for certain problems in a specific component; others deal with scenarios involving multiple components.

Solutions

There are external solutions for high availability that donít require additional features in the server itself. These include better performing and more reliable software and hardware (like, RAID arrays). Some areas are essential, such as a robust operating system. Of course, the database server must be able to take advantage of the improved sub-systems. A pure Java system is most likely to accomplish this.

In the general case, external solutions for high availability only solve part of the problem. Complete or robust solutions necessitate built-in capabilities for the database server itself. This article discusses server implemented technologies to sustain service levels and to recover from one or more fail points. We will look at:

Online Backup

Online Backup is a server implemented technology that protects against media (disk) failure by maintaining a change log on a separate device. Online Backup provides continuous backup that is always active. The purpose of this facility is to protect the physical database against loss of information due to permanent or transient errors in the storage media. This would include head crashes, miswrites and unreadable or unreliable media, as well as more catastrophic events.

Online Backup provides a solution for a single fail-point -- the physical database. A number of failure events can render the live database unusable through loss of data availability. Online Backup provides a way to restore a physical database to its current state.

Offline backup makes a snapshot of the current physical database, backing it up to an independent device. If a major failure occurs in the storage media of the physical database, the snapshot backup is used to restore the physical database. This restores the database to its previous state (its state when the backup was made), losing any changes made since the snapshot.

Online Backup solves the problem of lost changes in offline backup by providing a continuous backup that is always active. Like offline backup, Online Backup begins by taking a snapshot backup of the current database. During subsequent runs, the DBMS logs all committed changes to a roll forward journal on an independent device. Each run appends its changes to the journal.

To ensure full synchronization between the journal and the database, the DBMS flushes all changes to the journal as part of transaction commit. The data in the roll forward journal reflects the current (committed) state of the live database, during runs and between runs. A recovery process can return the database to its current state by restoring the backup snapshot and applying the changes in the roll forward journal.

When the database is rendered unusable (loss of data availability), the online backup facility can restore the database to its current state.

Like offline backup, restore for Online Backup begins by restoring the database from the backup copy. This returns the active database to its initial state. Restore for Online Backup then processes the roll forward journal and applies all committed changes made since the initial backup. This returns the active database to its latest state.

Replication

Replication is server implemented technology that protects against database failure by mirroring changes on a secondary server. It also provides a second access point for data that can protect against link failures and can share the load in high traffic situations. Like Online Backup, Replication is always active during a server run.

Replication uses two database servers -- a primary (active) server and a secondary (mirror) server, normally running on different machines. Each server has its own physical database. Before startup, the two databases are copies of each other. They are identical representations of the current state of the data.

Both servers startup at the same time and establish a communications link between them. During a run, the primary server sends committed changes to the secondary server which mirrors the changes in its database. At any point, the two servers will contain the same data, the same state. In addition, the secondary server is available for client access, in read-only mode.

Replication servers provide a fail-point solution for the physical database. The physical database for the secondary server is an up-to-date copy of the primary database. In the event that the primary database is rendered unusable, recovery can restore it from the secondary physical database or can simply switch processing to the secondary database.

Replication servers also provide an alternate connection point for clients, with restricted access (read-only). This is useful for relieving client traffic on the primary server and as a fall-back mode of operation should the primary server, primary database and/or communications link to the primary server fail. Thus, replication serves as a fail-point solution for the primary server and its communications link, although services are limited.

Using more than one replication server will enhance availability. They provide multiple points of service and recovery, distributing the load and the recovery responsibilities.

A server run with replication involves a primary database server and one or more secondary servers that mirror changes from the primary. The secondary servers normally reside on separate machines with independent storage media (disk sub-systems) for their physical database. The secondary servers can utilize separate (sub-) networks for better redundancy.

The primary and secondary start operation together. At startup, the primary server establishes a communications link to the secondaries. It uses this link to transmit database changes to the mirroring server. The secondary servers apply the changes to their local database, so that it always matches the state of the primary database. During the run, the secondary servers are available for client access (read-only) for load balancing.

Similar to the situation with Online Backup, the primary server coordinates its transaction commit with the secondary servers. This ensures the secondary databases are always up-to-date for client access and in case of failure of the primary.

If any component of the primary fails (server, database, communications link), replication servers provide several recovery options. The system can continue to operate in fall-back mode by switching client access to the secondary servers. This can be an automatic switch by the client.

Replication servers also provide the capability to restore the system to primary operation. Recovery can restore the primary physical database from the secondary database, after resolution of any failures in the primary system. Alternatively, a secondary server can become the primary server, perhaps switching roles with the primary server.

Fault Recovery

Fault Recovery is a server implemented technology that allows a hot switch from a primary server to a secondary server. Fault Recovery is an enhancement of a Replication system. In basic replication, failure of the primary system will cause some loss of service. Either the system will switch to secondary mode utilizing the mirror server, or recovery will be initiated to bring the system back to full operation, resulting in service downtime.

Fault Recovery or fail-over mode avoids reduction of service levels by switching the active system from the primary server to a replication server, known as a standby server. On failure of the primary system, the standby server switches to active mode, allowing update operations. The standby becomes the primary server, taking over full responsibilities. This is a hot switch to a running process; no new processes are started.

Fault Recovery is a high availability technology. It provides a solution for the major fail-points in a database server -- the server itself, the physical database and the communications link. If any of these component fail in the primary server, the standby server automatically becomes the active (primary) server with little or no effect on service levels.

While the primary server is still up and running (active), the standby server functions in the same manner as a replication server. It receives all updates from the primary. The standby server supports read-only connections by clients for access to the current state of the database.

For enhanced reliability, multiple standby servers can be configured. When the primary fails, one of the standby servers becomes the primary, and the other standbys receive their updates from the new primary.

A Fault Recovery system utilizing standby servers operates similarly to a Replication system. The primary server and secondary (standby) servers start up together, and the secondary servers receive transaction updates from the primary server. The standby servers are available for read-only access by clients.

The difference from a Replication system is when the primary fails. In a Fault Recovery system, primary server failure causes a switch-over to a standby server. High availability is maintained without reduction in the level of services.

Conclusion

External hardware and software solutions (such as, RAID arrays, high performance computers and operating systems) can contribute to improved availability for a database system, but a complete system requires extended capabilities in the DBMS itself. DBMS technologies can ensure acceptable levels of service and recovery from multiple fail-points.

Choosing the right technology (such as the 3 discussed here) for a highly available system depends on a number of factors -- available resources, minimum level of service desired, the business cost of reduced services and failure recovery, as well as security and physical considerations. Here are some guidelines:

Online Backup

Online Backup protects against failure in a single component -- the physical database, but it is the most important component. With Online Backup, failure of any component will result in lost of availability (downtime). If the physical database becomes unusable, Online Backup provides full recovery by restoring the snapshot backup and rolling forward changes from the journal.

Advantages:

Replication Servers

Replication servers are used when a higher level of availability is desired. Replication servers provide full protection against physical database failure. In addition, they provide an alternate client connection point and a fall-back mode of operation should the primary server fail. Replication servers support more facile recovery facilities, though some downtime may result.

Advantages:

Fault Recovery

When the highest level of availability is desired, standby servers are used to provide Fault Recovery (fail-over) processing. Fault Recovery supports protection for all components of the primary system -- server, database and communications link. Like replication servers, standby servers provide alternate client connection points. Unlike replication servers, Fault Recovery automatically switches the standby server to fully active on primary failure, with little or no downtime.

Advantages:

An additional configuration choice is to use standby servers together with replication servers. The replication servers add an extra level of protection. While this approach seems like adding a belt to suspenders (to hold up your pants), it is appropriate when there are special constraints -- physical, security, resources, ...


Copyright © 2004 FFE Software, Inc. All Rights Reserved Worldwide