High Availability for Database Systems in Geographically Distributed Cloud Computing Environments

Meng, Huangdong

High Availability for Database Systems in Geographically Distributed Cloud Computing Environments

Files

Meng_Huangdong.pdf (1.24 MB)

Date

2014-01-21

Authors

Meng, Huangdong

Publisher

University of Waterloo

Abstract

In recent years, cloud storage systems have become very popular due to their good scal- ability and high availability. However, these storage systems provide limited transactional capabilities, which makes developing applications that use these systems substantially more difficult than developing applications that use a traditional SQL-based relational database management systems (DBMS). There have been solutions that provide transactional SQL-based DBMS services on the cloud, including solutions that use cloud shared storage systems to store the data. However, none of these solutions take advantage of the shared cloud storage architecture to provide DBMS high availability. These solutions typically deal with the failure of a DBMS server by restarting this server and going through crash recovery based on the transaction log, which can lead to long DBMS service downtimes that are not acceptable to users. It is possible to run traditional DBMS high availability solutions in cloud environments. These solutions are typically based on shipping the transaction log from a primary server to a backup server, and replaying the log at the backup server to keep it up to date with the primary. However, these solutions do not work well if the primary and backup are in different, geographically distributed data centers due to the high latency of log shipping. Furthermore, these solutions do not take advantage of the capabilities of the underlying shared storage system. We present a new transparent high availability system for transactional SQL-based DBMS on a shared storage architecture, which we call CAC-DB (Continuous Access Cloud DataBase). Our system is especially designed for eventually consistent cloud storage systems that run efficiently in multiple geographically distributed data centers. The database and transaction logs are stored in such a storage system, and therefore remain available after a failure up to the failure of an entire data center (e.g., in a natural disaster). CAC-DB takes advantage of this shared storage to ensure that the DBMS service remains available and transactionally consistent in the face of failures up to the loss of one or more data centers. By taking advantage of shared storage, CAC-DB can run in a geographically distributed environment with minimal overhead as compared to traditional log shipping solutions. In CAC-DB, an active (primary) and a standby (backup) DBMS run on different servers in different data centers. The standby catches up with the active's memory state by replaying the shared log. When the active crashes, the standby can finish the failover process and reach peak throughput very quickly. The DBMS service only experiences several seconds of downtime. While the basic idea of replaying the log is simple and not new, the shared storage environment poses many new challenges including the need for synchronization protocols, new buffer pool management mechanisms, approaches for guaranteeing strong consistency without sacrifi cing performance and new shared storage based failure detection mechanism. This thesis solves these challenges and presents a system that achieves the following goal: if a data center fails, not only does the persistent image of the database on the storage tier survive, but also the DBMS service can resume almost uninterrupted and reach peak throughput in a very short time. At the same time, the throughput of the DBMS service in normal processing is not negatively affected. Our experiments with CAC-DB running on EC2 con rm that it can achieve the above goals.