Data-intensive Scheduling

MENG, XIAO

Data-intensive Scheduling

dc.contributor.advisor	Golab, Lukasz
dc.contributor.author	MENG, XIAO
dc.date.accessioned	2019-09-19T19:39:23Z
dc.date.available	2019-09-19T19:39:23Z
dc.date.issued	2019-09-19
dc.date.submitted	2019-09-16
dc.description.abstract	In many modern data management scenarios, we encounter tasks, operations or computational phases that are data-intensive where the sheer volume of data proves to be overwhelming to handle and becomes a performance bottleneck. For data-intensive tasks, the bottleneck is data loading, where the cost of loading data into memory is more significant than the cost of actual computation. For data-intensive shuffling, the bottleneck is data transfer, where intermediate data are scattered and shuffled for further processing. This thesis addresses two data-intensive scheduling problems: (1) multi-processor scheduling for data-intensive tasks to reduce redundant data loading; (2) reducer scheduling for data-intensive shuffling to reduce redundant data communication. For data-intensive tasks, we focus on workloads with precedence constraints of data dependencies, which are common in various applications such as data analytics and ETL processing. These workloads are often known in advance, are presented as directed acyclic graphs (DAG), and are data-intensive and sensitive to cache misses. We solve the problem of scheduling DAGs of data-intensive tasks on multiple processors or machines, in order to minimize execution time. To do so, we propose scheduling algorithms that take cache misses into account. Simulations and an experimental evaluation using a Spark cluster demonstrate the advantages of our solutions in terms of workload completion time. For data-intensive shuffling, we focus on MapReduce-style processing. Communication overhead is incurred in the Shuffle stage which sends intermediate results from mappers to reducers. We solve this problem: given a collection of mapper outputs (intermediate key-value pairs) and a partitioning of this collection among the reducers, which node should each reducer run on to minimize data transfer? We reduce two natural formulations of this problem to optimization problems for which polynomial solutions exist. We show that our techniques can cut communication costs by 50 percent or more compared to Hadoop’s default reducer placement, which leads to lower network utilization and faster MapReduce job runtimes.	en
dc.identifier.uri	http://hdl.handle.net/10012/15084
dc.language.iso	en	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	scheduling	en
dc.subject	database	en
dc.subject	data analytics	en
dc.title	Data-intensive Scheduling	en
dc.type	Master Thesis	en
uws-etd.degree	Master of Mathematics	en
uws-etd.degree.department	David R. Cheriton School of Computer Science	en
uws-etd.degree.discipline	Computer Science	en
uws-etd.degree.grantor	University of Waterloo	en
uws.contributor.advisor	Golab, Lukasz
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: MENG_XIAO.pdf
Size:: 2.76 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.08 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science