A Scalable Recoverable Skip List for Persistent Memory on NUMA Machines

Chowdhury, Sakib

A Scalable Recoverable Skip List for Persistent Memory on NUMA Machines

Files

Chowdhury_Sakib.pdf (3.39 MB)

Date

2021-10-20

Authors

Chowdhury, Sakib

Advisor

Golab, Wojciech

Publisher

University of Waterloo

Abstract

Interest in recoverable, persistent-memory-resident (PMEM-resident) data structures is growing as availability of Intel Optane Data Center Persistent Memory increases. An interesting use case for in-memory, recoverable data structures is for database indexes, which need high availability and reliability. Skip lists are a data structure particularly well-suited for usage as a fully PMEM-resident index, due to their reduced amount of writes from their probabilistic balancing in comparison to other index data structures like B-trees. The Untitled Persistent Skip List (UPSkipList) is a PMEM-resident recoverable skip list derived from Herlihy et al.'s lock-free skip list algorithm. It is developed using a new conversion technique that extends the RECIPE algorithm by Lee et al. to work on lock-free algorithms with non-blocking writes and no inherent recovery mechanism. It does this by tracking the current time period between two failures, or failure-free epoch, and recording the current epoch in nodes when they are being modified. This way, an observing thread can determine if an inconsistent node is being modified in this epoch or was being modified in a previous epoch and now is in need of recovery. The algorithm is also extended to support concurrent data node splitting to improve performance, which is easily made recoverable using the extension to RECIPE allowing detection of incomplete node splits. UPSkipList also supports cache-efficient NUMA awareness of dynamically allocated objects using an extension to the Region-ID in Value (RIV) method by Chen et al. By using additional bits after the most significant bits in an RIV pointer to indicate the object in which the remaining bits are referenced relative to, chunks of memory can by dynamically allocated to UPSkipList from multiple shared pools without the need for fat pointers, which reduce cache efficiency by halving the number of pointers that can fit in a cache line. This combines the benefits of both the RIV method and the dynamic memory allocation method built into the Persistent Memory Development Kit (PMDK), improving both performance and practicality. Additionally, memory manually managed within a chunk using the RIV method can have its recovery after a crash deferred to the next attempted allocation by a thread sharing the ID with the thread responsible for the allocation of the memory being recovered, reducing recovery time for large pools with many threads active during the time of a crash. Comparison was done against the BzTree of Arulraj et al., as implemented by Lersch et al., which has non-blocking, non-repairing writes implemented using the persistent multi-word CAS (PMwCAS) primitive by Wang et al., and a transactional recoverable skip list implemented using the PMDK. Tested with the Yahoo Cloud Serving Benchmark (YCSB), UPSkipList achieves better performance in write-heavy workloads at high levels of concurrency than BzTree, and outperforms the PMDK-based skip list, due to the PMDK-based skip list's higher average latency. Using the extended RIV pointers to dynamically allocate memory resulted in a 40% performance increase over using the PMDK's fat pointers. The impact of NUMA awareness using multiple pools of memory compared with striping a single pool across multiple nodes was found to only be a 5.6% decrease in performance. Finally, recovery time of UPSkipList was found to be comparable to the PMDK-based skip list, and 9 times faster than BzTree with 500K descriptors in its PMwCAS pool. Correctness of UPSkipList and its conversion and recovery techniques were tested using black-box recoverable linearizability analysis, which found UPSkipList to be free of strict linearizability errors across 30 trials.