Multi-agent Learning for Cooperative Scheduling of Microsecond-scale Services at Rack Scale

Hossein Abbasi Abyaneh, Ali

Multi-agent Learning for Cooperative Scheduling of Microsecond-scale Services at Rack Scale

Files

HosseinAbbasiAbyaneh_Ali.pdf (818.89 KB)

Date

2022-01-25

Authors

Hossein Abbasi Abyaneh, Ali

Advisor

Zahedi, Seyed Majid

Publisher

University of Waterloo

Abstract

This work considers the load-balancing problem in dense racks running microsecond-scale services. In such a system, balancing the load among hundreds to thousands of cores requires making millions of scheduling decisions per second. Achieving this throughput while providing microsecond-scale tail latency and high availability is extremely challenging. To address this challenge, we design a fully distributed load-balancing framework. In this framework, servers cooperatively balance the load in the system. We model the interactions among servers as a cooperative stochastic game. In this game, servers make scheduling decisions upon receiving and completing tasks. When a server receives a task, it decides whether to keep the task or migrate the task to another server. Moreover, when a server completes a task, it decides if it needs to steal a task from another server. We propose a distributed multi-agent learning algorithm to find the game's parametric Nash equilibrium. Our proposed algorithm enables servers to make scheduling decisions in tens of nanoseconds based on (possibly outdated) estimates of the load on other servers. We implement and deploy our distributed load-balancing algorithm on a rack-scale computer with 264 physical cores. We compare our load balancing algorithm with state-of-the-art load balancing disciplines. Our proposed solution provides up to 20% more throughput at low tail latency than widely used load balancing policies.

URI

http://hdl.handle.net/10012/17968

Collections

Theses
Electrical and Computer Engineering

Full item page

Multi-agent Learning for Cooperative Scheduling of Microsecond-scale Services at Rack Scale

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections