Struck-at Fault Tolerance with Emerging Technology RAM in the NeuroSim MLP Neural Network System

Zhang, An Qi

Struck-at Fault Tolerance with Emerging Technology RAM in the NeuroSim MLP Neural Network System

Files

Zhang_An-Qi.pdf (2.1 MB)

Date

2021-09-15

Authors

Zhang, An Qi

Advisor

Wei, Lan

Publisher

University of Waterloo

Abstract

After decades of technology advancements, benefits from conventional dimensional scaling and effective scaling such as strain and high-k gate dielectrics are diminishing. In the post-Moore era, interests gathers around emerging technologies with greater performance that silicon. One such emerging technology is Carbon Nanotube Field-Effect Transistors (CNFETs). CNFETs have the potential to offer a lower power, higher performance semiconductor technology compared to it’s silicon counterpart. However, CNFET technology being an emerging technology that has not yet reached maturity are still subject to high fault levels and levels of process variation. These high fault levels mean CNFET processes are unsuitable for semiconductor fabrication as general purpose designs require very low fault rates. Attempts to use CNFET processes have required increasing CNFET transistor sizes, defeating the purpose of finding a technology to replace silicon and continue Dennard scaling. Some success using emerging technologies has been achieved by using these emerging technologies with fault tolerant applications, such as neural networks in machine learning. In order to understand the impact of neural networks to process faults, this work analyses the effect of stuck-at faults in neural networks. The NeuroSim system is used which implements a 2 layer Multi-Layer Perceptron (MLP) Neural Network. These two layers contain weight values which are stored in two Static Random Access Memory (SRAM) units. Stuck-at faults are applied to the two SRAM units in various patterns. These networks are re-trained to account for the faults, where the resulting accuracy indicates the resilience of the neural network system to stuck-at faults. With the effects of the stuck-at faults understood, fault recovery techniques to mitigate the effect of the stuck-at faults are proposed and evaluated. In the worst case without any recovery technique, the network’s accuracy drops from 93.77% to 23.37% at a high fault rate of 40%. The fault rate indicates the percentage of SRAM bits affected by stuck-at faults. Stuck-at faults cause a SRAM bit cell to only read out one value, either 0 or 1. With a recovery technique, the accuracy is improved to 88.08% at a fault rate of 40%.