hub Canonical reference

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee · 2024

Canonical reference. 100% of citing Pith papers cite this work as background.

13 Pith papers citing it

Background 100% of classified citations

browse 13 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

cs.AI · 2024-07-01 · accept · novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

cs.AI · 2024-07-06 · conditional · novelty 6.0

LogicVista is a new benchmark dataset with 448 visual logic questions that evaluates multimodal LLMs on five reasoning tasks covering nine capabilities.

Long Context Transfer from Language to Vision

cs.CV · 2024-06-24 · unverdicted · novelty 6.0

Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.

SGLang: Efficient Execution of Structured Language Model Programs

cs.AI · 2023-12-12 · conditional · novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

cs.CV · 2023-05-13 · accept · novelty 6.0

OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

cs.CV · 2025-10-24 · unverdicted · novelty 5.0

NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

cs.CV · 2024-04-25 · conditional · novelty 5.0

A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.

ZAYA1-VL-8B Technical Report

cs.CV · 2026-05-08 · unverdicted · novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.

citing papers explorer

Showing 13 of 13 citing papers.

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? cs.AI · 2024-07-01 · accept · none · ref 39
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding cs.CV · 2024-06-13 · conditional · none · ref 35
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 27
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models cs.CV · 2026-05-10 · unverdicted · none · ref 11
COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? cs.CV · 2026-05-09 · unverdicted · none · ref 24
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts cs.AI · 2024-07-06 · conditional · none · ref 58
LogicVista is a new benchmark dataset with 448 visual logic questions that evaluates multimodal LLMs on five reasoning tasks covering nine capabilities.
Long Context Transfer from Language to Vision cs.CV · 2024-06-24 · unverdicted · none · ref 48
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 28
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models cs.CV · 2023-05-13 · accept · none · ref 71
OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models cs.CV · 2026-04-09 · unverdicted · none · ref 16
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation cs.CV · 2025-10-24 · unverdicted · none · ref 24
NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning cs.CV · 2024-04-25 · conditional · none · ref 28
A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.
ZAYA1-VL-8B Technical Report cs.CV · 2026-05-08 · unverdicted · none · ref 24
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer