A rate-distortion framework for lossy compression of transformer representations yields substantial bitrate savings on language tasks while preserving accuracy, with observed rates aligning to derived information-theoretic bounds.
Xu, et al., A survey of resource-efficient LLM and multimodal foun- dation models, arXiv preprint arXiv:2401.08092 (2024)
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
Empirical tests show compressed code language models retain task performance but suffer markedly lower robustness under four standard adversarial attacks.
A cross-platform mobile application deploys an ensemble of quantized open-source LLMs for fully local, DSM-5-aligned psychiatric decision support with claimed accuracy comparable to prior cloud versions.
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.
citing papers explorer
-
Rate-Distortion Optimization for Transformer Inference
A rate-distortion framework for lossy compression of transformer representations yields substantial bitrate savings on language tasks while preserving accuracy, with observed rates aligning to derived information-theoretic bounds.
-
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
-
Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code
Empirical tests show compressed code language models retain task performance but suffer markedly lower robustness under four standard adversarial attacks.
-
Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support
A cross-platform mobile application deploys an ensemble of quantized open-source LLMs for fully local, DSM-5-aligned psychiatric decision support with claimed accuracy comparable to prior cloud versions.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
-
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.