Xu, et al., A survey of resource-efficient LLM and multimodal foun- dation models, arXiv preprint arXiv:2401.08092 (2024)

Xu, M · 2024 · arXiv 2401.08092

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Rate-Distortion Optimization for Transformer Inference

cs.LG · 2026-01-29 · unverdicted · novelty 5.0

A rate-distortion framework for lossy compression of transformer representations yields substantial bitrate savings on language tasks while preserving accuracy, with observed rates aligning to derived information-theoretic bounds.

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

cs.PF · 2025-08-22 · unverdicted · novelty 5.0

ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.

Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code

cs.SE · 2025-08-05 · unverdicted · novelty 5.0

Empirical tests show compressed code language models retain task performance but suffer markedly lower robustness under four standard adversarial attacks.

Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support

cs.AI · 2026-04-20 · unverdicted · novelty 4.0

A cross-platform mobile application deploys an ensemble of quantized open-source LLMs for fully local, DSM-5-aligned psychiatric decision support with claimed accuracy comparable to prior cloud versions.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

cs.LG · 2026-04-23 · unverdicted · novelty 2.0

The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.

citing papers explorer

Showing 6 of 6 citing papers.

Rate-Distortion Optimization for Transformer Inference cs.LG · 2026-01-29 · unverdicted · none · ref 12
A rate-distortion framework for lossy compression of transformer representations yields substantial bitrate savings on language tasks while preserving accuracy, with observed rates aligning to derived information-theoretic bounds.
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference cs.PF · 2025-08-22 · unverdicted · none · ref 69
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code cs.SE · 2025-08-05 · unverdicted · none · ref 37
Empirical tests show compressed code language models retain task performance but suffer markedly lower robustness under four standard adversarial attacks.
Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support cs.AI · 2026-04-20 · unverdicted · none · ref 14
A cross-platform mobile application deploys an ensemble of quantized open-source LLMs for fully local, DSM-5-aligned psychiatric decision support with claimed accuracy comparable to prior cloud versions.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 24
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models cs.LG · 2026-04-23 · unverdicted · none · ref 2
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.

Xu, et al., A survey of resource-efficient LLM and multimodal foun- dation models, arXiv preprint arXiv:2401.08092 (2024)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer