Enhancing Data Processing on Clouds with Hadoop/HBase

Zhang, Chen

dc.contributor.author	Zhang, Chen
dc.date.accessioned	2011-10-12 14:34:35 (GMT)
dc.date.available	2011-10-12 14:34:35 (GMT)
dc.date.issued	2011-10-12T14:34:35Z
dc.date.submitted	2011
dc.identifier.uri	http://hdl.handle.net/10012/6361
dc.description.abstract	In the current information age, large amounts of data are being generated and accumulated rapidly in various industrial and scientific domains. This imposes important demands on data processing capabilities that can extract sensible and valuable information from the large amount of data in a timely manner. Hadoop, the open source implementation of Google's data processing framework (MapReduce, Google File System and BigTable), is becoming increasingly popular and being used to solve data processing problems in various application scenarios. However, being originally designed for handling very large data sets that can be divided easily in parts to be processed independently with limited inter-task communication, Hadoop lacks applicability to a wider usage case. As a result, many projects are under way to enhance Hadoop for different application needs, such as data warehouse applications, machine learning and data mining applications, etc. This thesis is one such research effort in this direction. The goal of the thesis research is to design novel tools and techniques to extend and enhance the large-scale data processing capability of Hadoop/HBase on clouds, and to evaluate their effectiveness in performance tests on prototype implementations. Two main research contributions are described. The first contribution is a light-weight computational workflow system called "CloudWF" for Hadoop. The second contribution is a client library called "HBaseSI" supporting transactional snapshot isolation (SI) in HBase, Hadoop's database component. CloudWF addresses the problem of automating the execution of scientific workflows composed of both MapReduce and legacy applications on clouds with Hadoop/HBase. CloudWF is the first computational workflow system built directly using Hadoop/HBase. It uses novel methods in handling workflow directed acyclic graph decomposition, storing and querying dependencies in HBase sparse tables, transparent file staging, and decentralized workflow execution management relying on the MapReduce framework for task scheduling and fault tolerance. HBaseSI addresses the problem of maintaining strong transactional data consistency in HBase tables. This is the first SI mechanism developed for HBase. HBaseSI uses novel methods in handling distributed transactional management autonomously by individual clients. These methods greatly simplify the design of HBaseSI and can be generalized to other column-oriented stores with similar architecture as HBase. As a result of the simplicity in design, HBaseSI adds low overhead to HBase performance and directly inherits many desirable properties of HBase. HBaseSI is non-intrusive to existing HBase installations and user data, and is designed to work with a large cloud in terms of data size and the number of nodes in the cloud.	en
dc.language.iso	en	en
dc.publisher	University of Waterloo	en
dc.subject	Cloud	en
dc.subject	Hadoop	en
dc.subject	HBase	en
dc.subject	Snapshot Isolation	en
dc.subject	Distributed Transaction	en
dc.subject	Workflow	en
dc.subject	Data Processing	en
dc.title	Enhancing Data Processing on Clouds with Hadoop/HBase	en
dc.type	Doctoral Thesis	en
dc.pending	false	en
dc.subject.program	Computer Science	en
uws-etd.degree.department	School of Computer Science	en
uws-etd.degree	Doctor of Philosophy	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en

Files in this item

Name:: Zhang_Chen.pdf
Size:: 2.179Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Show simple item record