Implementations of iterative algorithms in Hadoop and Spark

Lai, Junyu

Implementations of iterative algorithms in Hadoop and Spark

Files

Lai_Junyu.pdf (2.84 MB)

Date

2014-07-29

Authors

Lai, Junyu

Publisher

University of Waterloo

Abstract

Facing the challenges of large amounts of data generated by various companies (such as Facebook, Amazon, and Twitter), cloud computing frameworks such as Hadoop are used to store and process the Big Data. Hadoop, an open source cloud computing framework, is popular because of its scalability and fault tolerance. However, by frequently writing and reading data from the Hadoop Distributed File System (HDFS), Hadoop is quite slow in many applications. Apache Spark, a new cloud computing framework developed at AMPLab of UC Berkeley, solves this problem by caching data in memory. Spark develops a new abstraction called resilient distributed dataset (RDD) which is both scalable and fault-tolerant. In this thesis, we describe the architecture of Hadoop and Spark and discuss their differences. Properties of RDDs and how they work in Spark are discussed in detail, which gives a guide on how to use them efficiently. The main contribution of the thesis is to implement the PageRank algorithm and Conjugate Gradient (CG) method in Hadoop and Spark, and show how Spark out-performs Hadoop by taking advantage of memory caching.

Keywords

Hadoop, Spark, Resilient Distributed Datasets, Conjugate Gradient method

URI

http://hdl.handle.net/10012/8586

Collections

Theses
Applied Mathematics

Full item page

Implementations of iterative algorithms in Hadoop and Spark

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections