High Performance and Predictable Memory Request Handling for Safety-Critical Multicores

Wu, Zhuanhao

High Performance and Predictable Memory Request Handling for Safety-Critical Multicores

Files

Wu_Zhuanhao.pdf (3.43 MB)

Date

2025-05-20

Authors

Wu, Zhuanhao

Advisor

Patel, Hiren

Publisher

University of Waterloo

Abstract

Multi-core platforms are becoming the de-facto compute platforms for safety-critical systems such as ones used in automotive and avionics systems. Recent works improve on the timing predictability and performance of safety-critical systems by designing and implementing predictable cache coherence protocols for these multicores. Yet, their application is restricted because of a few restrictions in the system architecture. First, prior works assume a single-level cache hierarchy, where each core has a private cache and a main memory is shared by all cores. However, in practice, a multi-level cache hierarchy is used, where each core has a private cache and a shared last-level cache, enabling the performance improvement by reducing the average memory access latency. Second, prior works on predictable cache coherence protocols assume that each core can issue at most one outstanding memory request. With recent interests in adopting out-of-order cores in safety-critical systems, the restriction on the number of outstanding memory requests can limit the overall performance of the system. This thesis aims to address these restrictions and improve the performance of safety-critical systems by designing and implementing predictable cache coherence protocols for multicore systems with a multi-level cache hierarchy and multiple outstanding memory requests. The thesis has three main contribution: First, this thesis shows that the sharing of an inclusive last-level cache among in a multicore system can lead to unbounded worst-case latency for memory requests. We additionally show that when deployed with a 1S-TDM arbitration schedule, the worst-case latency of memory requests can be bounded; yet, this worst-case latency bound is still prohibitively large, including a multiplicative factor of the number of cache lines in the private caches. The large worst-case latency is result of the interaction between the back-invalidations and write-backs of private cache blocks when multiple cores accessing the same cache set in the last-level cache. To address this issue, we propose a novel technique that provides predictable performance for memory requests in a multicore system when they share a last-level cache partition. Our approach eliminate the multiplicative factor of the number of cache lines in the private caches in the worst-case latency bound. Second, we observe that the large worst-case memory request latency under shared last-level cache questions the applicability of the inclusive last-level cache in safety-critical multicores. To this end, we propose a novel cache architecture called the ZeroCost LLC that provides predictable performance for memory requests in a multicore system when they share a last-level cache partition. The ZeroCost LLC completely eliminates the back-invalidation in the inclusive last-level cache by performing cache line relocation during a conflict set access. The ZeroCost LLC enables the use of an inclusive last-level cache in safety-critical multicores where cores share data without incurring worst-case latency penalties. Third, to further improve the performance, we aim at lifting the restrictions in prior works on coherent data sharing mechanism, where each core has at most one outstanding memory request. Cores that can issue multiple memory requests, such as out-of-order cores, can improve the performance of the system by allowing more memory requests to be issued in parallel. We observe that when multiple outstanding memory requests are enabled, the mechanism that enforces memory consistency model in the cores can lead to increased worst-case latency for memory requests. To address this issue, we propose a novel approach in the cache coherence protocol that provides predictable worst-case latency for memory requests in a multicore system when multiple outstanding memory requests are enabled. Finally, we investigate scaling up the number of cores in a multicore system by incorporating a predictable directory-based cache coherence protocol, ditty. Ditty leverages the observation that the arbitration of coherence messages in the directory is the main source of unpredictability in the cache coherence protocol, and it deploys a predictable arbitration mechanism to provide predictable performance for memory requests in a multicore system.