pith. machine review for the scientific record. sign in

arxiv: 2604.09073 · v1 · submitted 2026-04-10 · 💻 cs.AR

Recognition: unknown

DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:01 UTC · model grok-4.3

classification 💻 cs.AR
keywords diffusion modelsfault toleranceDVFSenergy efficiencyinference optimizationABFTresilience analysisvoltage scaling
0
0 comments X

The pith

Diffusion models have enough built-in fault tolerance to run safely at lower voltages or higher frequencies, cutting energy use by 36% on average or speeding inference by 1.7 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that diffusion models are resilient enough to errors that accelerators can deliberately underscale voltage or overclock frequency without ruining the generated images or videos. Current DVFS methods either play it too safe and gain little efficiency or push too hard and lose quality because they ignore this tolerance. DRIFT addresses the gap by first mapping which network blocks and timesteps are most vulnerable, then applying voltage or frequency changes only where safe and using a targeted rollback to fix the few critical faults that still occur. The result is a practical way to lower the high power and latency cost of deploying these models.

Core claim

DRIFT is a co-optimization framework that first analyzes the resilience of representative diffusion models, then uses a fine-grained DVFS policy to protect only error-sensitive blocks and timesteps while an adaptive ABFT rollback mechanism corrects critical faults by reverting to earlier timesteps; memory offloading intervals and data layouts are also tuned to limit overhead. Experiments show this combination preserves generation quality under aggressive voltage underscaling for 36% average energy savings or under overclocking for 1.7 times average speedup across models and datasets.

What carries the argument

The resilience-aware DVFS strategy that selectively shields vulnerable network blocks and timesteps, combined with the adaptive ABFT rollback that reverts only when critical errors are detected.

If this is right

  • Aggressive voltage underscaling becomes viable for diffusion inference, yielding 36% average energy reduction while generation quality holds.
  • Overclocking becomes viable, delivering 1.7 times average speedup with no quality penalty.
  • Memory overhead stays manageable because offloading intervals and data layouts are reorganized around the protected regions.
  • The same resilience mapping can guide DVFS decisions across different diffusion architectures and datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar selective-protection plus rollback patterns could reduce energy in other iterative generative models that share the same denoising structure.
  • Hardware accelerators might expose lightweight rollback hooks or per-block voltage domains to make this style of optimization cheaper to implement.
  • The approach implies that error-correction resources in AI chips can be allocated dynamically rather than applied uniformly, freeing area and power for other uses.

Load-bearing premise

Diffusion models contain enough inherent fault tolerance that protecting only the sensitive blocks and timesteps plus rolling back critical errors is enough to keep output quality intact when voltage or frequency is pushed aggressively.

What would settle it

Apply the proposed voltage underscaling to a diffusion model without the selective protection or rollback steps and measure whether standard quality metrics such as FID scores degrade beyond the thresholds reported in the paper's experiments.

Figures

Figures reproduced from arXiv: 2604.09073 by Jinqi Wen, Meng Li, Runsheng Wang, Tong Xie.

Figure 1
Figure 1. Figure 1: Limitations of applying DVFS to diffusion genera [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: ABFT can indicate error magnitude and location. [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Bit-level resilience on (a) DiT and (b) PixArt. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Block-level resilience on (a) DiT and (b)PixArt [PITH_FULL_IMAGE:figures/full_fig_p003_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Core techniques in DRIFT. (a) Fine-grained [PITH_FULL_IMAGE:figures/full_fig_p004_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Architecture design for DRIFT. 4.4 Self-Correction Ability To understand error propagation in the multi-step diffusion process, we inject errors at an intermediate denoising step [PITH_FULL_IMAGE:figures/full_fig_p004_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Details in DRIFT techniques. (a) Correction mask [PITH_FULL_IMAGE:figures/full_fig_p005_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison with previous works. (a)(c) DRIFT [PITH_FULL_IMAGE:figures/full_fig_p006_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Evaluation of (a) fine-grained resilience-aware [PITH_FULL_IMAGE:figures/full_fig_p006_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Design space exploration on (a) ABFT threshold, [PITH_FULL_IMAGE:figures/full_fig_p006_14.png] view at source ↗
read the original abstract

Diffusion model deployment has been suffering from high energy consumption and inference latency despite its superior performance in visual generation tasks. Dynamic voltage and frequency scaling (DVFS) offers a promising solution to exploit the potential of the underlying accelerators. However, existing approaches often lead to either limited efficiency gains or degraded output quality because they overlook the inherent fault tolerance of the diffusion model. Therefore, in this paper, we propose DRIFT, a novel algorithmarchitecture co-optimization framework that harnesses the fault tolerance for efficient and reliable diffusion model inference. We first perform a comprehensive resilience analysis on representative diffusion models. Building on these observations, we introduce a fine-grained, resilience-aware DVFS strategy that selectively protects error-sensitive network blocks and timesteps, and a rollback algorithm-based fault tolerance (ABFT) mechanism that adaptively corrects only critical errors by reverting to previous timesteps. We further optimize offloading intervals and reorganize data layouts to reduce memory overhead. Experiments across diverse models and datasets show that DRIFT can achieve on average 36% energy savings through voltage underscaling or 1.7x speedup via overclocking while maintaining generation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DRIFT, an algorithm-architecture co-optimization framework for diffusion model inference on accelerators. It begins with a resilience analysis of representative diffusion models to identify error-sensitive network blocks and timesteps, then applies a fine-grained DVFS strategy that selectively protects these components while using an adaptive ABFT rollback mechanism to correct only critical errors by reverting to prior timesteps. Additional optimizations include offloading intervals and data layout reorganization. Experiments across diverse models and datasets are reported to yield average 36% energy savings via voltage underscaling or 1.7x speedup via overclocking, all while maintaining generation quality.

Significance. If the central claims hold under realistic hardware conditions, DRIFT would demonstrate a practical way to exploit the inherent fault tolerance of diffusion models for substantial efficiency gains in energy and latency, which is valuable for deploying generative models on resource-constrained accelerators. The selective protection plus adaptive correction approach could influence fault-tolerant design in ML inference more broadly.

major comments (2)
  1. [Resilience Analysis] Resilience Analysis section: The manuscript does not specify the fault injection methodology or error model (e.g., whether errors are injected as independent random bit flips or as spatially/temporally correlated timing violations that arise from real voltage underscaling or frequency overclocking). This distinction is load-bearing for the central claim because the identification of 'error-sensitive' blocks/timesteps and the timing of rollback decisions will differ under realistic DVFS error patterns versus synthetic uniform faults; without this detail the reported 36% savings and 1.7x speedup cannot be verified to translate to actual hardware.
  2. [Experimental Evaluation] Experimental Evaluation section: The headline efficiency numbers lack accompanying details on the hardware platform, DVFS implementation, number of experimental runs, statistical tests, or controls for confounding variables such as varying error rates across timesteps. Without these, it is impossible to determine whether the quality preservation and net gains (after rollback overhead) are robust or specific to the chosen synthetic conditions.
minor comments (2)
  1. [Abstract] Abstract: The summary paragraph states positive results but supplies no methodology details, error models, or statistical controls, which reduces the ability to assess the claims at a glance.
  2. [Figures and Notation] Notation and figures: Ensure that any diagrams of the rollback mechanism and DVFS policy clearly label the protected blocks, timesteps, and correction thresholds so readers can trace how the adaptive decisions are made.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and details.

read point-by-point responses
  1. Referee: [Resilience Analysis] Resilience Analysis section: The manuscript does not specify the fault injection methodology or error model (e.g., whether errors are injected as independent random bit flips or as spatially/temporally correlated timing violations that arise from real voltage underscaling or frequency overclocking). This distinction is load-bearing for the central claim because the identification of 'error-sensitive' blocks/timesteps and the timing of rollback decisions will differ under realistic DVFS error patterns versus synthetic uniform faults; without this detail the reported 36% savings and 1.7x speedup cannot be verified to translate to actual hardware.

    Authors: We agree that explicit description of the fault model is essential for validating the resilience analysis. Our fault injection was performed using a hybrid model: independent bit-flip probabilities calibrated from measured timing violation rates under voltage scaling on the target accelerator, augmented with spatially correlated errors derived from circuit-level simulations of DVFS-induced faults (following established models in prior DVFS reliability literature). We have added a new subsection 'Fault Injection Methodology' in the Resilience Analysis section that fully specifies the error model, injection procedure, correlation parameters, and how it approximates real hardware DVFS behavior. This addition directly supports the identification of error-sensitive blocks and the adaptive rollback thresholds. revision: yes

  2. Referee: [Experimental Evaluation] Experimental Evaluation section: The headline efficiency numbers lack accompanying details on the hardware platform, DVFS implementation, number of experimental runs, statistical tests, or controls for confounding variables such as varying error rates across timesteps. Without these, it is impossible to determine whether the quality preservation and net gains (after rollback overhead) are robust or specific to the chosen synthetic conditions.

    Authors: We acknowledge that the original Experimental Evaluation section omitted several reproducibility details. We have substantially expanded this section to report: the exact hardware platform (NVIDIA A100 GPUs with software-controlled DVFS via NVIDIA Management Library), DVFS implementation (voltage steps of 25 mV and frequency ranges with per-block granularity), number of runs (50 independent trials per configuration using different random seeds for both model inference and fault injection), statistical tests (paired t-tests with p < 0.05 for quality and efficiency metrics), and controls for confounding variables (per-timestep error rate measurements and explicit accounting of rollback overhead in net speedup/energy calculations). These additions demonstrate that the reported 36% energy savings and 1.7x speedup remain robust after overheads and across varying error conditions. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical resilience analysis and experimental validation are independent of the design claims

full rationale

The paper's chain is: (1) perform resilience analysis on diffusion models under faults, (2) use those observations to select error-sensitive blocks/timesteps and design selective protection plus adaptive ABFT rollback, (3) optimize offloading and layouts, (4) measure energy/speedup on hardware. None of these steps reduce by construction to their inputs. The resilience analysis is presented as an independent empirical study whose outputs (which blocks/timesteps are sensitive) are then applied; the final 36% / 1.7x numbers come from end-to-end experiments, not from fitting parameters and relabeling them as predictions. No self-citation is invoked as a uniqueness theorem or load-bearing premise. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Approach rests on the domain assumption of inherent model fault tolerance identified via analysis; no new entities or fitted constants are described in the abstract.

axioms (1)
  • domain assumption Diffusion models exhibit inherent fault tolerance to hardware-induced errors
    Stated as the foundational observation enabling the DVFS and ABFT strategies.

pith-pipeline@v0.9.0 · 5502 in / 1223 out tokens · 93399 ms · 2026-05-10T17:01:00.851012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Denoising diffusion probabilistic models,

    J. Hoet al., “Denoising diffusion probabilistic models, ”Proc. NIPS, vol. 33, pp. 6840– 6851, 2020

  2. [2]

    Denoising diffusion implicit models,

    J. Songet al., “Denoising diffusion implicit models, ” inProc. ICLR

  3. [3]

    Aig-cim: A scalable chiplet module with tri-gear heterogeneous compute-in-memory for diffusion acceleration,

    Y. Jinget al., “Aig-cim: A scalable chiplet module with tri-gear heterogeneous compute-in-memory for diffusion acceleration, ” inProc. DAC, pp. 1–6, 2024

  4. [4]

    Ditto: Accelerating diffusion model via temporal value similarity,

    S. Kimet al., “Ditto: Accelerating diffusion model via temporal value similarity, ” inProc. HPCA, pp. 338–352, IEEE, 2025

  5. [5]

    Cambricon-d: Full-network differential acceleration for diffusion models,

    W. Konget al., “Cambricon-d: Full-network differential acceleration for diffusion models, ” inProc. ISCA, pp. 903–914, IEEE, 2024

  6. [6]

    Mhdiff: Memory-and hardware-efficient diffusion acceleration via focal pixel aware quantization,

    C. Qiet al., “Mhdiff: Memory-and hardware-efficient diffusion acceleration via focal pixel aware quantization, ” inProc. DAC, pp. 1–7, IEEE, 2025

  7. [7]

    Radit: Redundancy-aware diffusion transformer acceleration lever- aging timestep similarity,

    Y. Parket al., “Radit: Redundancy-aware diffusion transformer acceleration lever- aging timestep similarity, ” inProc. DAC, pp. 1–7, IEEE, 2025

  8. [8]

    Exion: Exploiting inter-and intra-iteration output sparsity for diffusion models,

    J. Heoet al., “Exion: Exploiting inter-and intra-iteration output sparsity for diffusion models, ” inProc. HPCA, pp. 324–337, 2025

  9. [9]

    Fewer denoising steps or cheaper per-step inference: Towards compute-optimal diffusion model deployment,

    Z. Duet al., “Fewer denoising steps or cheaper per-step inference: Towards compute-optimal diffusion model deployment, ” inProc. CVPR, pp. 3001–3010, 2025

  10. [10]

    Shieldenn: Online accelerated framework for fault-tolerant deep neural network architectures,

    N. Khoshaviet al., “Shieldenn: Online accelerated framework for fault-tolerant deep neural network architectures, ” inProc. DAC, pp. 1–6, IEEE, 2020

  11. [11]

    Selective hardening for neural networks in fpgas,

    F. Libanoet al., “Selective hardening for neural networks in fpgas, ”IEEE Transac- tions on Nuclear Science, vol. 66, no. 1, pp. 216–222, 2018

  12. [12]

    Razor: A low-power pipeline based on circuit-level timing specu- lation,

    D. Ernstet al., “Razor: A low-power pipeline based on circuit-level timing specu- lation, ” inProc. MICRO, pp. 7–18, IEEE, 2003

  13. [13]

    Thundervolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators,

    J. Zhanget al., “Thundervolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators, ” inProc. DAC, pp. 1–6, 2018

  14. [14]

    Effort: Enhancing energy efficiency and error resilience of a near-threshold tensor processing unit,

    N. D. Gundiet al., “Effort: Enhancing energy efficiency and error resilience of a near-threshold tensor processing unit, ” inProc. ASPDAC, pp. 241–246, IEEE, 2020

  15. [15]

    Greentpu: Improving timing error resilience of a near-threshold tensor processing unit,

    P. Pandeyet al., “Greentpu: Improving timing error resilience of a near-threshold tensor processing unit, ” inProc. DAC, pp. 1–6, 2019

  16. [16]

    Fault-tolerant systolic array based accelerators for deep neural network execution,

    J. J. Zhanget al., “Fault-tolerant systolic array based accelerators for deep neural network execution, ”IEEE Design & Test, vol. 36, no. 5, pp. 44–53, 2019

  17. [17]

    Mavfi: An end-to-end fault analysis framework with anomaly detection and recovery for micro aerial vehicles,

    Y.-S. Hsiaoet al., “Mavfi: An end-to-end fault analysis framework with anomaly detection and recovery for micro aerial vehicles, ” inProc. DATE, pp. 1–6, IEEE, 2023

  18. [18]

    Algorithm-based fault tolerance for matrix operations,

    K.-H. Huanget al., “Algorithm-based fault tolerance for matrix operations, ”IEEE Transactions on Computers, vol. 100, no. 6, pp. 518–528, 1984

  19. [19]

    Approxabft: Approximate algorithm-based fault tolerance for vision transformers,

    X. Xueet al., “Approxabft: Approximate algorithm-based fault tolerance for vision transformers, ”arXiv preprint arXiv:2302.10469, 2023

  20. [20]

    A novel fault-tolerant architecture for tiled matrix multiplication,

    S. Balet al., “A novel fault-tolerant architecture for tiled matrix multiplication, ” inProc. DATE, pp. 1–6, IEEE, 2023

  21. [21]

    Realm: Reliable and efficient large language model inference with statistical algorithm-based fault tolerance,

    T. Xieet al., “Realm: Reliable and efficient large language model inference with statistical algorithm-based fault tolerance, ” inProc. DAC, pp. 1–7, 2025

  22. [22]

    Ares: A framework for quantifying the resilience of deep neural networks,

    B. Reagenet al., “Ares: A framework for quantifying the resilience of deep neural networks, ” inProc. DAC, pp. 1–6, 2018

  23. [23]

    Understanding error propagation in deep learning neural network (dnn) accelerators and applications,

    G. Liet al., “Understanding error propagation in deep learning neural network (dnn) accelerators and applications, ” inProc. SC, pp. 1–12, 2017

  24. [24]

    Optimizing selective protection for cnn resilience.,

    A. Mahmoudet al., “Optimizing selective protection for cnn resilience., ” pp. 127– 138, 2021

  25. [25]

    Analyzing and improving fault tolerance of learning-based navi- gation systems,

    Z. Wanet al., “Analyzing and improving fault tolerance of learning-based navi- gation systems, ” inProc. DAC, pp. 841–846, IEEE, 2021

  26. [26]

    Frl-fi: Transient fault analysis for federated reinforcement learning- based navigation systems,

    Z. Wanet al., “Frl-fi: Transient fault analysis for federated reinforcement learning- based navigation systems, ” inProc. DATE, pp. 430–435, IEEE, 2022

  27. [27]

    Resilience assessment of large language models under transient hardware faults,

    U. K. Agarwalet al., “Resilience assessment of large language models under transient hardware faults, ” pp. 659–670, IEEE, 2023

  28. [28]

    High-resolution image synthesis with latent diffusion models,

    R. Rombachet al., “High-resolution image synthesis with latent diffusion models, ” inProc. CVPR, pp. 10684–10695, 2022

  29. [29]

    Scalable diffusion models with transformers,

    W. Peebleset al., “Scalable diffusion models with transformers, ” inProc. ICCV, pp. 4195–4205, 2023

  30. [30]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    J. Chenet al., “Pixart-alpha: Fast training of diffusion transformer for photoreal- istic text-to-image synthesis, ”arXiv preprint arXiv:2310.00426, 2023

  31. [31]

    Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation,

    Q. Liet al., “Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation, ” 2024

  32. [32]

    In-datacenter performance analysis of a tensor processing unit,

    N. P. Jouppiet al., “In-datacenter performance analysis of a tensor processing unit, ” inProc. ISCA, pp. 1–12, 2017

  33. [33]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,

    C. Luet al., “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps, ”Proc. NIPS, vol. 35, pp. 5775–5787, 2022

  34. [34]

    Progressive distillation for fast sampling of diffusion models,

    T. Salimanset al., “Progressive distillation for fast sampling of diffusion models, ” inProc. ICLR

  35. [35]

    Deepcache: Accelerating diffusion models for free,

    X. Maet al., “Deepcache: Accelerating diffusion models for free, ” inProc. CVPR, pp. 15762–15772, 2024

  36. [36]

    From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025

    J. Liuet al., “From reusing to forecasting: Accelerating diffusion models with taylorseers, ”arXiv preprint arXiv:2503.06923, 2025

  37. [37]

    Adaptive caching for faster video generation with diffusion transformers,

    K. Kahatapitiyaet al., “Adaptive caching for faster video generation with diffusion transformers, ” inProc. ICCV, pp. 15240–15252, 2025

  38. [38]

    Silent data corruptions at scale

    H. D. Dixitet al., “Silent data corruptions at scale, ”arXiv preprint arXiv:2102.11245, 2021

  39. [39]

    Variability-and reliability-aware design for 16/14nm and beyond technology,

    R. Huanget al., “Variability-and reliability-aware design for 16/14nm and beyond technology, ” inProc. IEDM, pp. 12–4, IEEE, 2017

  40. [40]

    Dependable dnn accelerator for safety-critical systems: A review on the aging perspective,

    I. Moghaddasiet al., “Dependable dnn accelerator for safety-critical systems: A review on the aging perspective, ”IEEE Access, 2023

  41. [41]

    Clim: A cross-level workload-aware timing error prediction model for functional units,

    X. Jiaoet al., “Clim: A cross-level workload-aware timing error prediction model for functional units, ”IEEE Transactions on Computers, vol. 67, no. 6, pp. 771–783, 2017

  42. [42]

    Read: Reliability-enhanced accelerator dataflow optimization using critical input pattern reduction,

    Z. Zhanget al., “Read: Reliability-enhanced accelerator dataflow optimization using critical input pattern reduction, ” inProc. ICCAD, pp. 1–9, IEEE, 2023

  43. [43]

    Dris-3: Deep neural network reliability improvement scheme in 3d die-stacked memory based on fault analysis,

    J.-S. Kimet al., “Dris-3: Deep neural network reliability improvement scheme in 3d die-stacked memory based on fault analysis, ” inProc. DAC, pp. 1–6, 2019

  44. [44]

    One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors,

    B. Sangchoolieet al., “One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors, ” inProc. DSN, pp. 97–108, IEEE, 2017

  45. [45]

    Dependability evaluation of stable diffusion with soft errors on the model parameters,

    Z. Gaoet al., “Dependability evaluation of stable diffusion with soft errors on the model parameters, ” inInternational Conference on Nanotechnology (NANO), pp. 442–447, IEEE, 2024

  46. [46]

    Exploiting dynamic timing slack for energy efficiency in ultra-low-power embedded systems,

    H. Cherupalliet al., “Exploiting dynamic timing slack for energy efficiency in ultra-low-power embedded systems, ”ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 671–681, 2016

  47. [47]

    Dynamic voltage and frequency scaling: The laws of dimin- ishing returns,

    E. Le Sueuret al., “Dynamic voltage and frequency scaling: The laws of dimin- ishing returns, ” inProceedings of the 2010 international conference on Power aware computing and systems, pp. 1–8, 2010

  48. [48]

    Avatar: An aging-and variation-aware dynamic timing analyzer for error-efficient computing,

    Z. Zhanget al., “Avatar: An aging-and variation-aware dynamic timing analyzer for error-efficient computing, ”IEEE TCAD, vol. 42, no. 11, pp. 4139–4151, 2023

  49. [49]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    G. Xiaoet al., “Smoothquant: Accurate and efficient post-training quantization for large language models, ” inProc. ICLR, pp. 38087–38099, PMLR, 2023

  50. [50]

    Imagenet: A large-scale hierarchical image database,

    J. Denget al., “Imagenet: A large-scale hierarchical image database, ” inProc. CVPR, pp. 248–255, Ieee, 2009

  51. [51]

    Microsoft coco: Common objects in context,

    T.-Y. Linet al., “Microsoft coco: Common objects in context, ” inProc. ECCV, pp. 740–755, Springer, 2014

  52. [52]

    Clipscore: A reference-free evaluation metric for image caption- ing,

    J. Hesselet al., “Clipscore: A reference-free evaluation metric for image caption- ing, ” pp. 7514–7528, 2021

  53. [53]

    Imagereward: Learning and evaluating human preferences for text- to-image generation,

    J. Xuet al., “Imagereward: Learning and evaluating human preferences for text- to-image generation, ”Proc. NIPS, vol. 36, pp. 15903–15935, 2023

  54. [54]

    The unreasonable effectiveness of deep features as a perceptual metric,

    R. Zhanget al., “The unreasonable effectiveness of deep features as a perceptual metric, ” inProc. CVPR, pp. 586–595, 2018

  55. [55]

    Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference,

    T. Tambeet al., “Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference, ” inProc. MICRO, pp. 830–844, 2021

  56. [56]

    Specee: Accelerating large language model inference with speculative early exiting,

    J. Xuet al., “Specee: Accelerating large language model inference with speculative early exiting, ” inProc. ISCA, pp. 467–481, 2025

  57. [57]

    0.5–1-v, 90–400-ma, modular, distributed, 3 × 3 digital ldos based on event-driven control and domino sampling and regulation,

    S. J. Kimet al., “0.5–1-v, 90–400-ma, modular, distributed, 3 × 3 digital ldos based on event-driven control and domino sampling and regulation, ”IEEE Journal Solid-State Circuits, vol. 56, no. 9, pp. 2781–2794, 2021

  58. [58]

    An open-source framework for autonomous soc design with analog block generation,

    T. Ajayiet al., “An open-source framework for autonomous soc design with analog block generation, ” in2020 IFIP/IEEE 28th International Conference on Very Large Scale Integration (VLSI-SOC), pp. 141–146, IEEE, 2020

  59. [59]

    Hbm (high bandwidth memory) dram technology and architecture,

    H. Junet al., “Hbm (high bandwidth memory) dram technology and architecture, ” inInternational Memory Workshop (IMW), pp. 1–4, IEEE, 2017

  60. [60]

    SCALE-Sim: Systolic CNN Accelerator Simulator

    A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “Scale-sim: Systolic cnn accelerator simulator, ”arXiv preprint arXiv:1811.02883, 2018

  61. [61]

    Photorealistic text-to-image diffusion models with deep lan- guage understanding,

    C. Sahariaet al., “Photorealistic text-to-image diffusion models with deep lan- guage understanding, ”Proc. NIPS, vol. 35, pp. 36479–36494, 2022

  62. [62]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heuselet al., “Gans trained by a two time-scale update rule converge to a local nash equilibrium, ”Proc. NIPS, vol. 30, 2017