PVF:Understanding AI Vulnerability Against SDCs

Abhinav Pandey; Daniel Moore; Fred Lin; Han Wang; Harish D. Dixit; Jianyu Huang; Joel Coburn; Sajin Nair; Sriram Sankar; Venkat Ramesh

arxiv: 2405.01741 · v4 · pith:26SVSETWnew · submitted 2024-05-02 · 💻 cs.CR · cs.AI· cs.AR· cs.LG

PVF:Understanding AI Vulnerability Against SDCs

Xun Jiao , Fred Lin , Harish D. Dixit , Joel Coburn , Sajin Nair , Abhinav Pandey , Han Wang , Venkat Ramesh

show 3 more authors

Jianyu Huang Daniel Moore Sriram Sankar

This is my paper

classification 💻 cs.CR cs.AIcs.ARcs.LG

keywords parametermodelvulnerabilitycorruptionsmodelsaddressclassificationdlrm

0 comments

read the original abstract

Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF has been a critical metric used for making key error management design decisions in productionizing Meta's in-house AI chip - MTIA.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FLARE: One-Shot PE-Level Fault Localization in Systolic Arrays via Algebraic Test Vectors
cs.AR 2026-05 unverdicted novelty 7.0

FLARE uses pairwise coprime test vectors to create unique divisibility signatures that localize faulty rows in systolic arrays with one test pass and over 98% probability for 256x256 INT16 arrays.