PVF:Understanding AI Vulnerability Against SDCs
read the original abstract
Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF has been a critical metric used for making key error management design decisions in productionizing Meta's in-house AI chip - MTIA.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
FLARE: One-Shot PE-Level Fault Localization in Systolic Arrays via Algebraic Test Vectors
FLARE uses pairwise coprime test vectors to create unique divisibility signatures that localize faulty rows in systolic arrays with one test pass and over 98% probability for 256x256 INT16 arrays.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.