CLPush: Proactive Cache Transfers in NUMA Applications

Loading...
Thumbnail Image

Date

2023-09-26

Authors

Pathak, Gautam

Advisor

Brown, Trevor

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Modern Non-Uniform Memory Access (NUMA) systems support a thread count of as much as 128 threads to support high performance applications. These systems usually employ a scalable cache-coherent directory mechanism to ensure that the most up-to-date data is passed around among all the cores. It is common to use invalidate-based protocols in such systems. NUMA applications incur a lot of overhead due to data not being present in a particular socket's cache and having to fetch it from a cache in another socket. For example, in applications such as the producer-consumer problem, when threads reside in two different sockets, having to consume data from a socket different than where data is produced can be extremely expensive. This cost occurs due to coherence messages having to cross the sockets when the consumer threads require the shared data. In this thesis, I present a cache manipulation instruction, coined CLPush, which proactively transfers data across to a predetermined destination, so as to reduce cache demand misses and improve performance. The optimization is presented as an instruction hint to the processor that directs a cache to send data across to another predetermined destination. I present various variants of CLPush, which involve having one or more destinations to transfer the data to. I also discuss the potential use cases of this instruction in different applications, such as the producer-consumer problem, and Futures and Promises. I also analyse the performance of CLPush in two variants of the producer-consumer problem.

Description

Keywords

cache, coherence, multicore, NUMA, non uniform memory access

LC Subject Headings

Citation