DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

Bo Bao; Eric Sather; Sai Qian Zhang; Tianhua Xia; Vithursan Thangarasa; Yunhai Hu; Zining Liu

arxiv: 2606.00535 · v1 · pith:Q2Q7VXUWnew · submitted 2026-05-30 · 💻 cs.LG

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

Zining Liu , Yunhai Hu , Tianhua Xia , Bo Bao , Eric Sather , Vithursan Thangarasa , Sai Qian Zhang This is my paper

Pith reviewed 2026-06-28 18:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords speculative decodingvision-language modelsneural architecture searchmultimodal generationautoregressive decodingdraft modelmodel accelerationattention entropy

0 comments

The pith

DREAM-S uses neural architecture search to automatically optimize draft models and their interactions with target models for faster speculative decoding in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DREAM-S as a speculative decoding framework built specifically for vision-language models. It relies on a neural architecture search process with target-aware supernet training to discover both the best interaction pattern between a smaller draft model and the main target model and the draft architecture that fits the target hardware. An attention-entropy-guided distillation step then trains the chosen draft efficiently. If the approach works as described, generation speed increases without separate hand-tuning for each model or platform. Experiments report speedups reaching 3.85 times over ordinary decoding and better results than prior speculative decoding methods across several established VLMs.

Core claim

DREAM-S is a speculative decoding framework for VLMs that employs a NAS framework with target-aware supernet training to automatically determine optimal interaction strategies between draft and target models and suitable draft architectures, along with adaptive intermediate feature distillation guided by attention entropy, leading to up to 3.85x speedup compared to standard decoding and better performance than existing SD baselines.

What carries the argument

The neural architecture search framework with target-aware supernet training that identifies optimal draft-target interaction strategies and draft architectures for given hardware.

If this is right

Speculative decoding becomes practical for VLMs without manual design of draft models or interaction rules.
Draft models can be trained more efficiently through attention-entropy-guided distillation from the target.
The same search process adapts the method to different hardware platforms automatically.
Generation latency drops while output quality remains comparable to the original target model.
Existing VLM inference pipelines can incorporate the framework with measurable speed gains over prior SD techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The search-based approach could be applied to other autoregressive multimodal tasks such as video or audio generation.
Hardware-specific drafts found this way might also reduce memory footprint during inference.
Combining the method with quantization or other compression steps could produce further cumulative speedups.
If the search cost stays low, the framework might support on-device adaptation when new VLMs are deployed.

Load-bearing premise

The neural architecture search process can reliably locate the single best interaction strategy and draft architecture for any target model and hardware platform.

What would settle it

Running the full search and then measuring generation time on a VLM and hardware pair not included in the original search, and finding that the resulting draft produces no speedup or lower quality than standard decoding.

Figures

Figures reproduced from arXiv: 2606.00535 by Bo Bao, Eric Sather, Sai Qian Zhang, Tianhua Xia, Vithursan Thangarasa, Yunhai Hu, Zining Liu.

**Figure 2.** Figure 2: DREAM-S framework overview. (a) Two-Phase Training: supernet training followed by subnetwork [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (a) FLOPs of the selected draft models. (b) DREAM-S performance in various NAS settings. (c) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-wise comparison of entropy, ∆entropy, and their sum [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision-language models (VLMs) remains relatively unexplored. We propose~\textit{DREAM-S}, a novel SD framework designed specifically for fast and efficient decoding in VLMs. DREAM-S leverages a neural architecture search (NAS) framework with target-aware supernet training to automatically identify both the optimal interaction strategy between the draft and target models, and the most suitable draft model architecture for the underlying hardware implementation platform. DREAM-S additionally incorporates adaptive intermediate feature distillation, guided by attention entropy, to enable efficient draft training. Experiments on a range of well-established VLMs show that DREAM-S achieves up to a $3.85\times$ speedup compared to standard decoding approaches and significantly outperforms existing SD baselines. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM-S .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DREAM-S adds NAS for draft selection and target-aware training to speculative decoding for VLMs, but the abstract gives no ablations to show the search actually drives the gains over simpler distillation.

read the letter

The core idea is applying speculative decoding to vision-language models with a NAS loop that searches both draft architectures and how they interact with the target, plus attention-entropy distillation to train the drafts. That combination is new relative to the SD papers cited.

What stands out is the focus on VLMs rather than text-only LLMs, the public code release, and the reported 3.85× speedup over standard decoding. Those numbers, if they hold across the tested models, would be useful for anyone trying to run these models faster on edge or varied hardware.

The soft spot is exactly the one the stress-test flags: nothing in the abstract isolates whether the NAS component finds meaningfully better drafts than fixed or random choices, or whether the gains mostly come from the distillation step. Without ablations that hold training budget fixed and compare searched versus hand-designed drafts, it is hard to know if the searchable part is load-bearing. The target-aware supernet training sounds sensible on paper, but the abstract does not show how well the supernet proxies the full VLM across vision and language tokens.

This is for researchers working on inference acceleration for multimodal models who already know the SD literature. A reader who wants to try the method on their own hardware would get value from the code and the high-level recipe. The work is coherent enough on its own terms to deserve a serious referee who can check the experimental controls and the actual contribution of the search.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DREAM-S, a speculative decoding framework for vision-language models (VLMs) that uses a neural architecture search (NAS) approach with target-aware supernet training to automatically discover optimal draft-model architectures and draft-target interaction strategies. It further incorporates adaptive intermediate feature distillation guided by attention entropy. Experiments on established VLMs report up to 3.85× speedup versus standard autoregressive decoding and consistent outperformance over prior speculative-decoding baselines, with code released publicly.

Significance. If the NAS component is shown to reliably identify superior drafts and interaction patterns beyond what fixed or hand-crafted designs achieve under comparable training budgets, the work would provide a practical, hardware-aware method for accelerating multimodal generation. The public code release is a clear strength that supports reproducibility.

major comments (2)

[Experiments section (results and ablations)] The headline claims of speedup and outperformance rest on the NAS framework with target-aware supernet training discovering better draft architectures and interaction strategies than existing SD baselines. No ablation is described that isolates the searchable-drafting component (e.g., searched vs. hand-crafted or random drafts) while holding the attention-entropy distillation and training budget fixed. Without such a comparison, gains cannot be confidently attributed to the NAS rather than the distillation alone.
[Method section (target-aware supernet training)] The target-aware supernet is asserted to serve as a faithful proxy for the full target VLM across both vision and language tokens. The manuscript should report quantitative validation of this proxy (e.g., correlation between supernet-predicted and target-model acceptance rates or token-level fidelity metrics) to substantiate that the search space exploration is meaningful for multimodal inputs.

minor comments (2)

[Abstract / Experiments] The abstract states results on “a range of well-established VLMs” but does not name the specific models, datasets, or hardware platforms used; these details should appear in the first paragraph of the experimental section for immediate clarity.
[Method section] Notation for draft-target interaction strategies (e.g., how many tokens are drafted per step or how verification is performed) should be introduced with a compact table or diagram early in the method section to aid readers unfamiliar with VLM-specific SD variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the contributions of the NAS component and the validity of the supernet proxy. We address each major comment below and will revise the manuscript accordingly to strengthen the experimental evidence and validation.

read point-by-point responses

Referee: [Experiments section (results and ablations)] The headline claims of speedup and outperformance rest on the NAS framework with target-aware supernet training discovering better draft architectures and interaction strategies than existing SD baselines. No ablation is described that isolates the searchable-drafting component (e.g., searched vs. hand-crafted or random drafts) while holding the attention-entropy distillation and training budget fixed. Without such a comparison, gains cannot be confidently attributed to the NAS rather than the distillation alone.

Authors: We agree that an explicit ablation isolating the searchable-drafting component—while holding the attention-entropy-guided distillation and training budget fixed—would strengthen attribution of gains to the NAS framework. In the revised manuscript we will add experiments comparing searched draft architectures and interaction strategies against both hand-crafted baselines and randomly sampled drafts under identical distillation settings and compute budgets. These results will be reported alongside the existing comparisons to prior SD methods. revision: yes
Referee: [Method section (target-aware supernet training)] The target-aware supernet is asserted to serve as a faithful proxy for the full target VLM across both vision and language tokens. The manuscript should report quantitative validation of this proxy (e.g., correlation between supernet-predicted and target-model acceptance rates or token-level fidelity metrics) to substantiate that the search space exploration is meaningful for multimodal inputs.

Authors: We concur that quantitative validation of the supernet as a proxy would better substantiate the search process for multimodal inputs. In the revision we will include additional metrics, such as Pearson correlation between supernet-predicted and target-model acceptance rates as well as token-level fidelity measures (e.g., feature similarity on vision and language tokens), to demonstrate the proxy's reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experiments, not self-referential definitions or fits

full rationale

The paper presents DREAM-S as a new NAS-based speculative decoding framework for VLMs, with claims of speedup supported by experiments on established models. No equations, fitted parameters, or self-citations appear in the abstract or described text that would reduce the performance claims to inputs by construction. The NAS component and distillation are presented as methodological contributions whose value is evaluated externally via benchmarks, keeping the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; all ledger entries are therefore empty.

pith-pipeline@v0.9.1-grok · 5724 in / 1057 out tokens · 25004 ms · 2026-06-28T18:54:18.844506+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

147 extracted references · 76 canonical work pages · 23 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[9]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[10]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[11]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Visualgpt: Data-efficient adaptation of pretrained language models for image captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[12]

Layer by Layer: Uncovering Hidden Representations in Language Models

Layer by layer: Uncovering hidden representations in language models , author=. arXiv preprint arXiv:2502.02013 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Vlrm: Vision-language models act as reward models for image captioning.arXiv preprint arXiv:2404.01911, 2024

VLRM: Vision-Language Models act as Reward Models for Image Captioning , author=. arXiv preprint arXiv:2404.01911 , year=

work page arXiv
[14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Scaling up vision-language pre-training for image captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[15]

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Ying- tao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei

Svdqunat: Absorbing outliers by low-rank components for 4-bit diffusion models , author=. arXiv preprint arXiv:2411.05007 , year=

work page arXiv
[16]

Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives , author=. arXiv preprint arXiv:2502.02723 , year=

work page arXiv
[17]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Quarot: Outlier-free 4-bit inference in rotated llms , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[18]

Palu: KV-Cache Compression with Low-Rank Projection , author=
[19]

Proceedings of the Conference on Artificial Intelligence (AAAI) , year=

Unified vision-language pre-training for image captioning and vqa , author=. Proceedings of the Conference on Artificial Intelligence (AAAI) , year=
[20]

int8 (): 8-bit matrix multiplication for transformers at scale , author=

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[21]

Bioengineering , year=

Vision--language model for visual question answering in medical imagery , author=. Bioengineering , year=
[22]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[23]

Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery , author=. arXiv preprint arXiv:2405.10948 , year=

work page arXiv
[24]

Proceedings of the Conference on Artificial Intelligence (AAAI) , year=

Leveraging large vision-language model as user intent-aware encoder for composed image retrieval , author=. Proceedings of the Conference on Artificial Intelligence (AAAI) , year=
[25]

Searchlvlms: A plug-and-play framework for augmenting large vision-language models by searching up-to-date internet knowledge.arXiv preprint arXiv:2405.14554, 2024

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge , author=. arXiv preprint arXiv:2405.14554 , year=

work page arXiv
[26]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Fast Transformer Decoding: One Write-Head is All You Need

Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1911
[29]

arXiv preprint arXiv:2310.17956 , year=

Qilin-med-vl: Towards chinese large vision-language model for general healthcare , author=. arXiv preprint arXiv:2310.17956 , year=

work page arXiv
[30]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Vision-language models for vision tasks: A survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[31]

arXiv preprint arXiv:2503.16365 , year=

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse , author=. arXiv preprint arXiv:2503.16365 , year=

work page arXiv
[32]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

Videogamebunny: Towards vision assistants for video games , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=
[33]

JMIR Formative Research , year=

Vision-language model for generating textual descriptions from clinical images: Model development and validation study , author=. JMIR Formative Research , year=
[34]

Proceedings of the International Conference on Machine Learning (ICML) , year=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=
[35]

Proceedings of the International Conference on Machine Learning (ICML) , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=
[36]

SmolVLM: Redefining small and efficient multimodal models

SmolVLM: Redefining small and efficient multimodal models , author=. arXiv preprint arXiv:2504.05299 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Tinyllava: A framework of small-scale large multimodal models,

Tinyllava: A framework of small-scale large multimodal models , author=. arXiv preprint arXiv:2402.14289 , year=

work page arXiv
[38]

Tinygpt-v: Efficient multimodal large language model via small backbones.arXiv preprint arXiv:2312.16862, 2023

Tinygpt-v: Efficient multimodal large language model via small backbones , author=. arXiv preprint arXiv:2312.16862 , year=

work page arXiv
[39]

A Survey on Hallucination in Large Vision-Language Models

A survey on hallucination in large vision-language models , author=. arXiv preprint arXiv:2402.00253 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Visual instruction tuning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[41]

CoRR , year=

PaliGemma: A versatile 3B VLM for transfer , author=. CoRR , year=
[42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

SpinQuant: LLM quantization with learned rotations

Spinquant: Llm quantization with learned rotations , author=. arXiv preprint arXiv:2405.16406 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Duquant: Distributing outliers via dual transformation makes stronger quantized llms , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[46]

Proceedings of the International Conference on Machine Learning (ICML) , year=

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=
[47]

arXiv preprint arXiv:2403.12544 , year=

Affinequant: Affine transformation quantization for large language models , author=. arXiv preprint arXiv:2403.12544 , year=

work page arXiv
[48]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024

Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks , author=. arXiv preprint arXiv:2402.04396 , year=

work page arXiv
[49]

Proceedings of the 32nd ACM International Conference on Multimedia , year=

Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation , author=. Proceedings of the 32nd ACM International Conference on Multimedia , year=
[50]

Q-vlm: Post-training quantization for large vision-language models.arXiv preprint arXiv:2410.08119, 2024

Q-VLM: Post-training Quantization for Large Vision-Language Models , author=. arXiv preprint arXiv:2410.08119 , year=

work page arXiv
[51]

Mbq: Modality-balanced quantization for large vision-language models.arXiv preprint arXiv:2412.19509, 2024

MBQ: Modality-Balanced Quantization for Large Vision-Language Models , author=. arXiv preprint arXiv:2412.19509 , year=

work page arXiv
[52]

Compressing pre-trained language models by matrix decomposition , author=. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing , year=
[53]

org/abs/2207.00112

Language model compression with weighted low-rank factorization , author=. arXiv preprint arXiv:2207.00112 , year=

work page arXiv
[54]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Asvd: Activation-aware singular value decomposition for compressing large language models , author=. arXiv preprint arXiv:2312.05821 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Basis shar- ing: Cross-layer parameter sharing for large language model compression

Svd-llm: Truncation-aware singular value decomposition for large language model compression , author=. arXiv preprint arXiv:2403.07378 , year=

work page arXiv
[56]

Palu: Compressing kv-cache with low-rank projection.arXiv preprint arXiv:2407.21118, 2024

Palu: Compressing kv-cache with low-rank projection , author=. arXiv preprint arXiv:2407.21118 , year=

work page arXiv
[57]

Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences , year=

Principal component analysis: a review and recent developments , author=. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences , year=
[58]

Effectively compress kv heads for llm.arXiv preprint arXiv:2406.07056, 2024

Effectively compress kv heads for llm , author=. arXiv preprint arXiv:2406.07056 , year=

work page arXiv
[59]

Adasvd: Adaptive singular value decomposition for large language models.arXiv preprint arXiv:2502.01403, 2025

AdaSVD: Adaptive Singular Value Decomposition for Large Language Models , author=. arXiv preprint arXiv:2502.01403 , year=

work page arXiv
[60]

A tutorial on Fisher information , author=
[61]

SVD-LLM V2: Op- timizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340, 2025b

SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression , author=. arXiv preprint arXiv:2503.12340 , year=

work page arXiv
[62]

Group fisher pruning for practical network compression , author=
[63]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Importance estimation for neural network pruning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[64]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Pre-rmsnorm and pre-crmsnorm transformers: equivalent and efficient pre-ln transformers , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[65]

Proceedings of Machine Learning and Systems , year=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of Machine Learning and Systems , year=
[66]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

Omniquant: Omnidirectionally calibrated quantization for large language models.arXiv preprint arXiv:2308.13137, 2023

Omniquant: Omnidirectionally calibrated quantization for large language models , author=. arXiv preprint arXiv:2308.13137 , year=

work page arXiv
[68]

The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=
[69]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Seed-bench: Benchmarking multimodal large language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[70]

CVPR , year=

VizWiz Grand Challenge: Answering Visual Questions from Blind People , author=. CVPR , year=
[71]

Proceedings of the 32nd ACM International Conference on Multimedia , year=

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models , author=. Proceedings of the 32nd ACM International Conference on Multimedia , year=
[72]

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate

Ke Yi and Zengke Liu and jianwei zhang and Chengyuan Li and Tong Zhang and Junyang Lin and Jingren Zhou , booktitle=. Rotated Runtime Smooth: Training-Free Activation Smoother for accurate
[73]

arXiv preprint arXiv:2306.07629 , year=

Squeezellm: Dense-and-sparse quantization , author=. arXiv preprint arXiv:2306.07629 , year=

work page arXiv
[74]

2023 , eprint=

Training Transformers with 4-bit Integers , author=. 2023 , eprint=

2023
[75]

arXiv preprint arXiv:2406.16858 , year=

Eagle-2: Faster inference of language models with dynamic draft trees , author=. arXiv preprint arXiv:2406.16858 , year=

work page arXiv
[76]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Eagle: Speculative sampling requires rethinking feature uncertainty , author=. arXiv preprint arXiv:2401.15077 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

arXiv preprint arXiv:2408.15766 , year=

Learning Harmonized Representations for Speculative Sampling , author=. arXiv preprint arXiv:2408.15766 , year=

work page arXiv
[78]

arXiv preprint arXiv:2410.03804 , year=

Mixture of Attentions For Speculative Decoding , author=. arXiv preprint arXiv:2410.03804 , year=

work page arXiv
[79]

arXiv preprint arXiv:2410.01296 , year=

Speculative Coreset Selection for Task-Specific Fine-tuning , author=. arXiv preprint arXiv:2410.01296 , year=

work page arXiv
[80]

arXiv preprint arXiv:2411.11055 , year=

FastDraft: How to Train Your Draft , author=. arXiv preprint arXiv:2411.11055 , year=

work page arXiv

Showing first 80 references.

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[9] [9]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[10] [10]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[11] [11]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Visualgpt: Data-efficient adaptation of pretrained language models for image captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

[12] [12]

Layer by Layer: Uncovering Hidden Representations in Language Models

Layer by layer: Uncovering hidden representations in language models , author=. arXiv preprint arXiv:2502.02013 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Vlrm: Vision-language models act as reward models for image captioning.arXiv preprint arXiv:2404.01911, 2024

VLRM: Vision-Language Models act as Reward Models for Image Captioning , author=. arXiv preprint arXiv:2404.01911 , year=

work page arXiv

[14] [14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Scaling up vision-language pre-training for image captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

[15] [15]

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Ying- tao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei

Svdqunat: Absorbing outliers by low-rank components for 4-bit diffusion models , author=. arXiv preprint arXiv:2411.05007 , year=

work page arXiv

[16] [16]

Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives , author=. arXiv preprint arXiv:2502.02723 , year=

work page arXiv

[17] [17]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Quarot: Outlier-free 4-bit inference in rotated llms , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[18] [18]

Palu: KV-Cache Compression with Low-Rank Projection , author=

[19] [19]

Proceedings of the Conference on Artificial Intelligence (AAAI) , year=

Unified vision-language pre-training for image captioning and vqa , author=. Proceedings of the Conference on Artificial Intelligence (AAAI) , year=

[20] [20]

int8 (): 8-bit matrix multiplication for transformers at scale , author=

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[21] [21]

Bioengineering , year=

Vision--language model for visual question answering in medical imagery , author=. Bioengineering , year=

[22] [22]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

[23] [23]

Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery , author=. arXiv preprint arXiv:2405.10948 , year=

work page arXiv

[24] [24]

Proceedings of the Conference on Artificial Intelligence (AAAI) , year=

Leveraging large vision-language model as user intent-aware encoder for composed image retrieval , author=. Proceedings of the Conference on Artificial Intelligence (AAAI) , year=

[25] [25]

Searchlvlms: A plug-and-play framework for augmenting large vision-language models by searching up-to-date internet knowledge.arXiv preprint arXiv:2405.14554, 2024

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge , author=. arXiv preprint arXiv:2405.14554 , year=

work page arXiv

[26] [26]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Fast Transformer Decoding: One Write-Head is All You Need

Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1911

[29] [29]

arXiv preprint arXiv:2310.17956 , year=

Qilin-med-vl: Towards chinese large vision-language model for general healthcare , author=. arXiv preprint arXiv:2310.17956 , year=

work page arXiv

[30] [30]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Vision-language models for vision tasks: A survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

[31] [31]

arXiv preprint arXiv:2503.16365 , year=

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse , author=. arXiv preprint arXiv:2503.16365 , year=

work page arXiv

[32] [32]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

Videogamebunny: Towards vision assistants for video games , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year=

[33] [33]

JMIR Formative Research , year=

Vision-language model for generating textual descriptions from clinical images: Model development and validation study , author=. JMIR Formative Research , year=

[34] [34]

Proceedings of the International Conference on Machine Learning (ICML) , year=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=

[35] [35]

Proceedings of the International Conference on Machine Learning (ICML) , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=

[36] [36]

SmolVLM: Redefining small and efficient multimodal models

SmolVLM: Redefining small and efficient multimodal models , author=. arXiv preprint arXiv:2504.05299 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Tinyllava: A framework of small-scale large multimodal models,

Tinyllava: A framework of small-scale large multimodal models , author=. arXiv preprint arXiv:2402.14289 , year=

work page arXiv

[38] [38]

Tinygpt-v: Efficient multimodal large language model via small backbones.arXiv preprint arXiv:2312.16862, 2023

Tinygpt-v: Efficient multimodal large language model via small backbones , author=. arXiv preprint arXiv:2312.16862 , year=

work page arXiv

[39] [39]

A Survey on Hallucination in Large Vision-Language Models

A survey on hallucination in large vision-language models , author=. arXiv preprint arXiv:2402.00253 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Visual instruction tuning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[41] [41]

CoRR , year=

PaliGemma: A versatile 3B VLM for transfer , author=. CoRR , year=

[42] [42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

SpinQuant: LLM quantization with learned rotations

Spinquant: Llm quantization with learned rotations , author=. arXiv preprint arXiv:2405.16406 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Duquant: Distributing outliers via dual transformation makes stronger quantized llms , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[46] [46]

Proceedings of the International Conference on Machine Learning (ICML) , year=

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=

[47] [47]

arXiv preprint arXiv:2403.12544 , year=

Affinequant: Affine transformation quantization for large language models , author=. arXiv preprint arXiv:2403.12544 , year=

work page arXiv

[48] [48]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024

Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks , author=. arXiv preprint arXiv:2402.04396 , year=

work page arXiv

[49] [49]

Proceedings of the 32nd ACM International Conference on Multimedia , year=

Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation , author=. Proceedings of the 32nd ACM International Conference on Multimedia , year=

[50] [50]

Q-vlm: Post-training quantization for large vision-language models.arXiv preprint arXiv:2410.08119, 2024

Q-VLM: Post-training Quantization for Large Vision-Language Models , author=. arXiv preprint arXiv:2410.08119 , year=

work page arXiv

[51] [51]

Mbq: Modality-balanced quantization for large vision-language models.arXiv preprint arXiv:2412.19509, 2024

MBQ: Modality-Balanced Quantization for Large Vision-Language Models , author=. arXiv preprint arXiv:2412.19509 , year=

work page arXiv

[52] [52]

Compressing pre-trained language models by matrix decomposition , author=. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing , year=

[53] [53]

org/abs/2207.00112

Language model compression with weighted low-rank factorization , author=. arXiv preprint arXiv:2207.00112 , year=

work page arXiv

[54] [54]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Asvd: Activation-aware singular value decomposition for compressing large language models , author=. arXiv preprint arXiv:2312.05821 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

Basis shar- ing: Cross-layer parameter sharing for large language model compression

Svd-llm: Truncation-aware singular value decomposition for large language model compression , author=. arXiv preprint arXiv:2403.07378 , year=

work page arXiv

[56] [56]

Palu: Compressing kv-cache with low-rank projection.arXiv preprint arXiv:2407.21118, 2024

Palu: Compressing kv-cache with low-rank projection , author=. arXiv preprint arXiv:2407.21118 , year=

work page arXiv

[57] [57]

Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences , year=

Principal component analysis: a review and recent developments , author=. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences , year=

[58] [58]

Effectively compress kv heads for llm.arXiv preprint arXiv:2406.07056, 2024

Effectively compress kv heads for llm , author=. arXiv preprint arXiv:2406.07056 , year=

work page arXiv

[59] [59]

Adasvd: Adaptive singular value decomposition for large language models.arXiv preprint arXiv:2502.01403, 2025

AdaSVD: Adaptive Singular Value Decomposition for Large Language Models , author=. arXiv preprint arXiv:2502.01403 , year=

work page arXiv

[60] [60]

A tutorial on Fisher information , author=

[61] [61]

SVD-LLM V2: Op- timizing singular value truncation for large language model compression.arXiv preprint arXiv:2503.12340, 2025b

SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression , author=. arXiv preprint arXiv:2503.12340 , year=

work page arXiv

[62] [62]

Group fisher pruning for practical network compression , author=

[63] [63]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Importance estimation for neural network pruning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

[64] [64]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Pre-rmsnorm and pre-crmsnorm transformers: equivalent and efficient pre-ln transformers , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[65] [65]

Proceedings of Machine Learning and Systems , year=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of Machine Learning and Systems , year=

[66] [66]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[67] [67]

Omniquant: Omnidirectionally calibrated quantization for large language models.arXiv preprint arXiv:2308.13137, 2023

Omniquant: Omnidirectionally calibrated quantization for large language models , author=. arXiv preprint arXiv:2308.13137 , year=

work page arXiv

[68] [68]

The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

[69] [69]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Seed-bench: Benchmarking multimodal large language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

[70] [70]

CVPR , year=

VizWiz Grand Challenge: Answering Visual Questions from Blind People , author=. CVPR , year=

[71] [71]

Proceedings of the 32nd ACM International Conference on Multimedia , year=

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models , author=. Proceedings of the 32nd ACM International Conference on Multimedia , year=

[72] [72]

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate

Ke Yi and Zengke Liu and jianwei zhang and Chengyuan Li and Tong Zhang and Junyang Lin and Jingren Zhou , booktitle=. Rotated Runtime Smooth: Training-Free Activation Smoother for accurate

[73] [73]

arXiv preprint arXiv:2306.07629 , year=

Squeezellm: Dense-and-sparse quantization , author=. arXiv preprint arXiv:2306.07629 , year=

work page arXiv

[74] [74]

2023 , eprint=

Training Transformers with 4-bit Integers , author=. 2023 , eprint=

2023

[75] [75]

arXiv preprint arXiv:2406.16858 , year=

Eagle-2: Faster inference of language models with dynamic draft trees , author=. arXiv preprint arXiv:2406.16858 , year=

work page arXiv

[76] [76]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Eagle: Speculative sampling requires rethinking feature uncertainty , author=. arXiv preprint arXiv:2401.15077 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[77] [77]

arXiv preprint arXiv:2408.15766 , year=

Learning Harmonized Representations for Speculative Sampling , author=. arXiv preprint arXiv:2408.15766 , year=

work page arXiv

[78] [78]

arXiv preprint arXiv:2410.03804 , year=

Mixture of Attentions For Speculative Decoding , author=. arXiv preprint arXiv:2410.03804 , year=

work page arXiv

[79] [79]

arXiv preprint arXiv:2410.01296 , year=

Speculative Coreset Selection for Task-Specific Fine-tuning , author=. arXiv preprint arXiv:2410.01296 , year=

work page arXiv

[80] [80]

arXiv preprint arXiv:2411.11055 , year=

FastDraft: How to Train Your Draft , author=. arXiv preprint arXiv:2411.11055 , year=

work page arXiv