hub Baseline reference

Ovis2.5 Technical Report

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma · 2025 · cs.CV · arXiv 2508.11737

Baseline reference. 62% of citing Pith papers use this work as a benchmark or comparison.

25 Pith papers citing it

Baseline 62% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 8 background 4 other 1

citation-polarity summary

baseline 8 background 4 unclear 1

representative citing papers

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

From Table to Cell: Attention for Better Reasoning with TABALIGN

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.

Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a training-free surrogate framework that outperforms baselines.

Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

cs.CL · 2026-04-30 · unverdicted · novelty 7.0

TopBench is a new benchmark exposing that LLMs default to table lookups instead of intent-aware predictive reasoning on tabular data, with intent disambiguation as a key prerequisite.

FCMBench-Video: Benchmarking Document Video Intelligence

cs.CV · 2026-04-28 · unverdicted · novelty 7.0

FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.

ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

cs.RO · 2026-02-09 · unverdicted · novelty 7.0

ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

cs.CV · 2026-02-04 · conditional · novelty 7.0

VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

VLMs fail to ground numerical values in spatial perception on new bidirectional tasks, relying on shallow cues instead of coordinate-aware representations.

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

A foveated VLM trained for scene comprehension produces human-like fixations, outperforming models trained for search, classification, or with altered peripheral vision.

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.

ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

cs.CV · 2026-04-04 · unverdicted · novelty 6.0

ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.

BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning

cs.RO · 2026-03-12 · unverdicted · novelty 6.0

BrainMem equips LLM-based embodied planners with working, episodic, and semantic memory that evolves interaction histories into retrievable knowledge graphs and guidelines, raising success rates on long-horizon 3D benchmarks.

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

cs.LG · 2026-02-20 · conditional · novelty 6.0 · 2 refs

MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

cs.AI · 2026-05-11 · unverdicted · novelty 5.0

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

cs.LG · 2025-09-16 · unverdicted · novelty 5.0

An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.

iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment

cs.CV · 2026-05-19 · unverdicted · novelty 4.0

iDiff is a dual-branch framework with an Answer Model for robust pairwise preference prediction via view decomposition and ensembles, and a Thinking Model for structured rationale generation using templates and answer-aware supervision, winning first place in the NTIRE 2026 RAIM challenge.

OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization

cs.CV · 2026-02-05 · unverdicted · novelty 4.0

OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.

citing papers explorer

Showing 25 of 25 citing papers.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 53 · internal anchor
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
From Table to Cell: Attention for Better Reasoning with TABALIGN cs.AI · 2026-05-14 · unverdicted · none · ref 41 · internal anchor
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation cs.CL · 2026-05-13 · unverdicted · none · ref 35 · internal anchor
Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a training-free surrogate framework that outperforms baselines.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters cs.CV · 2026-05-12 · unverdicted · none · ref 75 · internal anchor
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement cs.CV · 2026-05-08 · unverdicted · none · ref 32 · internal anchor
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering cs.CL · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
TopBench is a new benchmark exposing that LLMs default to table lookups instead of intent-aware predictive reasoning on tabular data, with intent disambiguation as a key prerequisite.
FCMBench-Video: Benchmarking Document Video Intelligence cs.CV · 2026-04-28 · unverdicted · none · ref 14 · internal anchor
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale cs.CV · 2026-04-06 · unverdicted · none · ref 23 · internal anchor
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs cs.RO · 2026-02-09 · unverdicted · none · ref 73 · internal anchor
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? cs.CV · 2026-02-04 · conditional · none · ref 10 · internal anchor
VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs cs.AI · 2026-05-22 · unverdicted · none · ref 18 · internal anchor
VLMs fail to ground numerical values in spatial perception on new bidirectional tasks, relying on shallow cues instead of coordinate-aware representations.
Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 59 · internal anchor
A foveated VLM trained for scene comprehension produces human-like fixations, outperforming models trained for search, classification, or with altered peripheral vision.
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 20 · internal anchor
SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning cs.CV · 2026-05-08 · unverdicted · none · ref 29 · internal anchor
ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 58 · internal anchor
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models cs.CV · 2026-04-08 · unverdicted · none · ref 52 · internal anchor
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CV · 2026-04-08 · unverdicted · none · ref 12 · internal anchor
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs cs.CV · 2026-04-04 · unverdicted · none · ref 32 · internal anchor
ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning cs.RO · 2026-03-12 · unverdicted · none · ref 27 · internal anchor
BrainMem equips LLM-based embodied planners with working, episodic, and semantic memory that evolves interaction histories into retrievable knowledge graphs and guidelines, raising success rates on long-horizon 3D benchmarks.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs? cs.LG · 2026-02-20 · conditional · none · ref 46 · 2 links · internal anchor
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks cs.AI · 2026-05-11 · unverdicted · none · ref 114 · internal anchor
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe cs.LG · 2025-09-16 · unverdicted · none · ref 8 · internal anchor
An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment cs.CV · 2026-05-19 · unverdicted · none · ref 14 · internal anchor
iDiff is a dual-branch framework with an Answer Model for robust pairwise preference prediction via view decomposition and ensembles, and a Thinking Model for structured rationale generation using templates and answer-aware supervision, winning first place in the NTIRE 2026 RAIM challenge.
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization cs.CV · 2026-02-05 · unverdicted · none · ref 53 · internal anchor
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs cs.CV · 2026-04-01 · unreviewed · ref 29 · internal anchor

Ovis2.5 Technical Report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer