CLPush: Proactive Cache Transfers in NUMA Applications
Loading...
Date
2023-09-26
Authors
Pathak, Gautam
Advisor
Brown, Trevor
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Modern Non-Uniform Memory Access (NUMA) systems support a thread count of as much as 128 threads to support high performance applications. These systems usually employ a scalable cache-coherent directory mechanism to ensure that the most up-to-date data is passed around among all the cores. It is common to use invalidate-based protocols in such systems. NUMA applications incur a lot of overhead due to data not being present in a particular socket's cache and having to fetch it from a cache in another socket. For example, in applications such as the producer-consumer problem, when threads reside in two different sockets, having to consume data from a socket different than where data is produced can be extremely expensive.
This cost occurs due to coherence messages having to cross the sockets when the consumer threads require the shared data. In this thesis, I present a cache manipulation instruction, coined CLPush, which proactively transfers data across to a predetermined destination, so as to reduce cache demand misses and improve performance.
The optimization is presented as an instruction hint to the processor that directs a cache to send data across to another predetermined destination. I present various variants of CLPush, which involve having one or more destinations to transfer the data to. I also discuss the potential use cases of this instruction in different applications, such as the producer-consumer problem, and Futures and Promises. I also analyse the performance of CLPush in two variants of the producer-consumer problem.
Description
Keywords
cache, coherence, multicore, NUMA, non uniform memory access