Llava-mini: Efficient image and video large mul- timodal models with one vision token

URL https://arxiv · 2025 · arXiv 2501.03895

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

cs.CV · 2026-05-07 · conditional · novelty 7.0

LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on multiple datasets.

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

cs.CV · 2026-05-20 · conditional · novelty 6.0

SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.

Geometry-Guided 3D Visual Token Pruning for Video-Language Models

cs.CV · 2026-04-20 · conditional · novelty 6.0

Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 6.0 · 2 refs

SVD-Prune selects vision tokens via SVD leverage scores to outperform attention-based pruning at extreme budgets of 32 or 16 tokens.

Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

cs.CV · 2025-08-08 · unverdicted · novelty 6.0

Fourier Compressor uses FFT to remove frequency-domain redundancy from visual tokens in VLMs, retaining over 96% accuracy with up to 83.8% FLOP reduction.

Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity

cs.AI · 2025-09-11 · unverdicted · novelty 5.0

A modular multimodal generative AI framework produces synthetic residential building data from public sources, with reported overlaps exceeding 65% against a national reference dataset.

citing papers explorer

Showing 9 of 9 citing papers.

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute cs.CV · 2026-05-07 · conditional · none · ref 55
LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on multiple datasets.
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models cs.CV · 2026-05-20 · conditional · none · ref 25
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 65
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading cs.LG · 2026-05-07 · unverdicted · none · ref 55
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
Geometry-Guided 3D Visual Token Pruning for Video-Language Models cs.CV · 2026-04-20 · conditional · none · ref 45
Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 115
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 8 · 2 links
SVD-Prune selects vision tokens via SVD leverage scores to outperform attention-based pruning at extreme budgets of 32 or 16 tokens.
Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models cs.CV · 2025-08-08 · unverdicted · none · ref 21
Fourier Compressor uses FFT to remove frequency-domain redundancy from visual tokens in VLMs, retaining over 96% accuracy with up to 83.8% FLOP reduction.
Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity cs.AI · 2025-09-11 · unverdicted · none · ref 36
A modular multimodal generative AI framework produces synthetic residential building data from public sources, with reported overlaps exceeding 65% against a national reference dataset.

Llava-mini: Efficient image and video large mul- timodal models with one vision token

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer