Show simple item record

dc.contributor.authorZhang, Chen
dc.date.accessioned2011-10-12 14:34:35 (GMT)
dc.date.available2011-10-12 14:34:35 (GMT)
dc.date.issued2011-10-12T14:34:35Z
dc.date.submitted2011
dc.identifier.urihttp://hdl.handle.net/10012/6361
dc.description.abstractIn the current information age, large amounts of data are being generated and accumulated rapidly in various industrial and scientific domains. This imposes important demands on data processing capabilities that can extract sensible and valuable information from the large amount of data in a timely manner. Hadoop, the open source implementation of Google's data processing framework (MapReduce, Google File System and BigTable), is becoming increasingly popular and being used to solve data processing problems in various application scenarios. However, being originally designed for handling very large data sets that can be divided easily in parts to be processed independently with limited inter-task communication, Hadoop lacks applicability to a wider usage case. As a result, many projects are under way to enhance Hadoop for different application needs, such as data warehouse applications, machine learning and data mining applications, etc. This thesis is one such research effort in this direction. The goal of the thesis research is to design novel tools and techniques to extend and enhance the large-scale data processing capability of Hadoop/HBase on clouds, and to evaluate their effectiveness in performance tests on prototype implementations. Two main research contributions are described. The first contribution is a light-weight computational workflow system called "CloudWF" for Hadoop. The second contribution is a client library called "HBaseSI" supporting transactional snapshot isolation (SI) in HBase, Hadoop's database component. CloudWF addresses the problem of automating the execution of scientific workflows composed of both MapReduce and legacy applications on clouds with Hadoop/HBase. CloudWF is the first computational workflow system built directly using Hadoop/HBase. It uses novel methods in handling workflow directed acyclic graph decomposition, storing and querying dependencies in HBase sparse tables, transparent file staging, and decentralized workflow execution management relying on the MapReduce framework for task scheduling and fault tolerance. HBaseSI addresses the problem of maintaining strong transactional data consistency in HBase tables. This is the first SI mechanism developed for HBase. HBaseSI uses novel methods in handling distributed transactional management autonomously by individual clients. These methods greatly simplify the design of HBaseSI and can be generalized to other column-oriented stores with similar architecture as HBase. As a result of the simplicity in design, HBaseSI adds low overhead to HBase performance and directly inherits many desirable properties of HBase. HBaseSI is non-intrusive to existing HBase installations and user data, and is designed to work with a large cloud in terms of data size and the number of nodes in the cloud.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectClouden
dc.subjectHadoopen
dc.subjectHBaseen
dc.subjectSnapshot Isolationen
dc.subjectDistributed Transactionen
dc.subjectWorkflowen
dc.subjectData Processingen
dc.titleEnhancing Data Processing on Clouds with Hadoop/HBaseen
dc.typeDoctoral Thesisen
dc.pendingfalseen
dc.subject.programComputer Scienceen
uws-etd.degree.departmentSchool of Computer Scienceen
uws-etd.degreeDoctor of Philosophyen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages