Redesigning Datacenter Systems to Leverage Hardware-Acceleration

Udayashankar, Sreeharsha

Redesigning Datacenter Systems to Leverage Hardware-Acceleration

Files

Udayashankar_Sreeharsha.pdf (18.4 MB)

Date

2026-05-27

Authors

Udayashankar, Sreeharsha

Advisor

Al-Kiswany, Samer

Publisher

University of Waterloo

Abstract

The exponential growth of digital data generation imposes severe performance and efficiency demands on modern datacenter infrastructure, creating unique interrelated challenges. Datacenter infrastructure must offer high data storage capacities, achieve high throughput, and support modern workloads that require low-latency data processing. While hardware accelerators, such as CPUs supporting SIMD vector instruction sets and network switches supporting P4-based programmability, have the potential to help achieve these requirements, their adoption in large-scale systems is hindered by restrictive programming models and resource constraints. This thesis addresses these challenges by redesigning deduplicated storage systems and cluster schedulers to leverage hardware acceleration effectively. It enables high-throughput data reduction in deduplicated storage systems (Chapter 3, Chapter 4, and Chapter 5) by using two approaches: redesigning them to use the SIMD capabilities of modern CPUs and by reducing the computation needed to achieve data reduction. It enables low-latency data processing by leveraging in-network acceleration for cluster scheduling (Chapter 6). The thesis first presents VectorCDC (Chapter 3), a method for accelerating data deduplication by restructuring hashless content-defined chunking (CDC) algorithms to exploit vector instructions. By identifying and optimizing the common processing patterns they use, Extreme Byte Searches and Range Scans, VectorCDC significantly improves their chunking throughputs. VRAM, the fastest VectorCDC-accelerated algorithm achieves throughput improvements of 8.35×–26.2× over existing vector-accelerated techniques and up to 207.2× over unaccelerated baselines. Importantly, VectorCDC maintains its throughput advantages across x86, ARM, and IBM CPU architectures. While generally competitive with their hash-based counterparts, these hashless CDC algorithms achieve lower deduplication efficiency on datasets with specific pathological patterns. To address this, this thesis presents WideCDC (Chapter 4). WideCDC improves the deduplication efficiency of hashless CDC algorithms by basing chunk boundary decisions on wide regions of multiple bytes, instead of singular byte values. To achieve high throughput, WideCDC uses vector-compatible Accumulated Extreme Byte Searches and Gated Range Scans. WideCDC improves deduplication efficiency on pathological datasets by 2.95× and further improves throughput by 2.04× over VectorCDC. Additionally, to address the throughput degradation of CDC algorithms at the large chunk sizes favored by production systems, this thesis presents SeqCDC (Chapter 5). SeqCDC is a novel chunking algorithm that uses a novel lightweight boundary detection mechanism, content-defined data skipping, and a vector instruction-focused design. SeqCDC improves chunking throughput by 10× over unaccelerated algorithms and 25–30% over the fastest vector-accelerated alternatives, while minimally affecting deduplication efficiency. Finally, this thesis proposes Draconis (Chapter 6), a network-accelerated scheduler built using P4 programmable switches, designed to support microsecond-scale workloads. Draconis forgoes the inefficient design adopted by prior switch-based schedulers by implementing a switch-compatible task queue with delayed pointer correction, eliminating the latency penalties caused by node-level blocking. Evaluation results demonstrate that Draconis reduces the 99th percentile scheduling delay by 3×–200× over state-of-the-art network-accelerated solutions, and increases scheduling throughput by 52×–116× over state-of-the-art server-based solutions.