Understanding NUMA Effects on Memory Allocation and Reclamation

Kim, DaewooUnderstanding NUMA Effects on Memory Allocation and ReclamationUniversity of Waterloo2023My UniversityMy UniversityBrown, Trevor2023-04-212023-04-212023-04-212023-04-11enMaster Thesishttp://hdl.handle.net/10012/19301Memory management in multicore systems is a well studied area. Many approaches to memory management have been developed and tuned with specific hardware architectures in mind, capitalizing on hardware characteristics to improve performance. In this thesis, the focus is on memory allocation and reclamation in multicore systems. I first identify and diagnose a performance anomaly in epoch based memory reclamation (EBR), one of the most popular approaches to reclaiming memory in multicore systems. EBR experiences significant performance degradation when running on multiple processor sockets. This degradation is related to the fact that EBR is vulnerable to thread delays. Even minor delays can trigger a chain reaction that induces longer delays and more substantial performance problems. Moreover, I discover a negative interaction between EBR and popular memory allocators, caused by the fact that EBR frees batches of objects, and these allocators attempt to cache batches of objects for reallocation. The batches freed by EBR frequently overflow the allocator buffers, defeating their purpose and causing substantial performance overhead. To solve these issues, an improvement to EBR, called amortized batch free is introduced to limit the amplification of delays and performance degradation when freeing. Amortized batch free gradually reclaims objects, and can drastically reduce the average time spent freeing an object. This technique is applied to a state of the art EBR algorithms, and significant performance improvements are shown experimentally. This amortized batch freeing technique appears broadly applicable to other memory reclamation algorithms. As a first step in demonstrating this, I also apply it to a simple token based variant of EBR. Token EBR is conceptually simpler and easier to implement than the state of the art EBR algorithm, but has been shown in other work to perform poorly. When the amortized batch free technique is used, Token EBR performs similarly to (and sometimes slightly better than) the state of the art EBR algorithm. Finally, I present a new design for an architecture aware memory allocator for multi-socket systems, using a state of the art allocator called Supermalloc as a starting point for my design. Several key bottlenecks in the original Supermalloc design are improved or eliminated in the new design. In particular, the new design dramatically improves performance when the address space is actively growing, reduces contention on shared resources, and optimizes memory accesses to reduce communication across processor sockets. Taking into account the lessons learned in the study of EBR, the new design also attempts to minimize the overhead of freeing objects. Experiments on a prototype of this new allocator show some performance improvement compared to the original Supermalloc allocator.