pith. sign in

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it
abstract

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.

citation-role summary

background 1

citation-polarity summary

fields

cs.CV 4

years

2026 4

verdicts

UNVERDICTED 4

roles

background 1

polarities

background 1

representative citing papers

Neutral-Reference Prompting for Vision-Language Models

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base-class performance.

Towards Joint Quantization and Token Pruning of Vision-Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

QUOTA jointly optimizes low-bit quantization and visual token pruning for VLMs by deriving pruning decisions from quantized operators, achieving 95.65% average performance retention with only 30% of visual tokens versus 94.3% for stage-wise baselines.

citing papers explorer

Showing 4 of 4 citing papers.