pith. machine review for the scientific record. sign in

arxiv: 2508.18265 · v2 · submitted 2025-08-25 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal modelsreinforcement learningvisual resolution routerreasoninginference efficiencyopen-source AIagentic tasksGUI interaction
0
0 comments X

The pith

InternVL3.5 uses Cascade RL and a Visual Resolution Router to lift open-source multimodal models to new reasoning levels and faster speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InternVL 3.5 as a family of open-source multimodal models that improve on prior versions through better reasoning and more efficient operation. It applies a Cascade RL process that first trains with offline reinforcement learning for stability and then switches to online reinforcement learning for tighter task alignment. This is combined with a Visual Resolution Router that changes the detail level of visual inputs on the fly to cut computation while keeping accuracy. The result is stronger performance on complex reasoning tests, quicker inference, and the ability to handle tasks like GUI control and physical agent behaviors. If the methods deliver as claimed, they move high-capability multimodal AI into more accessible open-source territory.

Core claim

The authors establish that the Cascade Reinforcement Learning framework, with its offline stage for stable convergence and online stage for refined alignment, together with the Visual Resolution Router for dynamic adjustment of visual token resolution and the Decoupled Vision-Language Deployment approach, produces up to a 16 percent gain in overall reasoning performance and a 4.05 times inference speedup over InternVL3, while adding support for GUI interaction and embodied agency, and allowing the largest InternVL3.5-241B-A28B variant to reach state-of-the-art results among open-source multimodal large language models on general multimodal, reasoning, text, and agentic tasks and thereby nar

What carries the argument

Cascade Reinforcement Learning framework that runs offline RL followed by online RL to strengthen reasoning, paired with the Visual Resolution Router that dynamically selects visual token resolution.

If this is right

  • Substantial gains on reasoning benchmarks such as MMMU and MathVista.
  • Up to 4.05 times faster inference while maintaining performance.
  • New support for GUI interaction and embodied agency tasks.
  • State-of-the-art results among open-source models across general multimodal, reasoning, text, and agentic tasks.
  • Reduced performance gap with commercial models such as GPT-5 on the largest variant.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-stage RL sequence might be tried on other model families to test whether it improves reasoning outside vision-language settings.
  • Dynamic visual resolution could be combined with hardware-specific optimizations to reduce energy use in deployed systems.
  • If the efficiency methods hold at larger scales, they may allow interactive multimodal agents to run with lower latency on varied computing setups.

Load-bearing premise

The gains in reasoning performance and inference speed come mainly from the Cascade RL framework and Visual Resolution Router rather than from larger model scale or unstated data and training improvements.

What would settle it

An ablation experiment that trains a comparable model without the Cascade RL stages or the ViR component and measures whether the reported improvements on MMMU, MathVista, and inference speed still occur.

read the original abstract

We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05$\times$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the InternVL3.5 family of open-source multimodal large language models (MLLMs), building on the InternVL series. It proposes a Cascade Reinforcement Learning (Cascade RL) framework with an offline RL stage for stable convergence followed by online RL for refined alignment, and a Visual Resolution Router (ViR) that dynamically selects visual token resolutions. These are paired with a Decoupled Vision-Language Deployment (DvD) strategy. The authors claim these yield up to +16% overall reasoning gains (e.g., on MMMU and MathVista) and 4.05× inference speedup versus InternVL3, enable new capabilities such as GUI interaction and embodied agency, and allow the largest InternVL3.5-241B-A28B model to reach state-of-the-art performance among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks while narrowing the gap to commercial models such as GPT-5. Models and code are publicly released.

Significance. If the reported gains can be shown to stem specifically from the proposed Cascade RL and ViR rather than scale, data, or compute increases, and if the experimental controls are rigorous, the work would meaningfully advance open-source MLLM capabilities in reasoning and efficiency. The public release of models and code is a clear positive that supports reproducibility and community progress.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The headline claims of +16% reasoning improvement and SOTA open-source results are presented without ablation studies that hold model size (241B-A28B), data mixture, and total training compute fixed while removing or replacing Cascade RL with standard RL/SFT and ViR with fixed-resolution routing. This leaves open the possibility that gains are driven by unstated increases in scale or data quality rather than the two named techniques.
  2. [§3.1] §3.1 (Cascade RL): The two-stage offline-then-online RL procedure is described at a high level but lacks concrete implementation details, reward model specifications, hyperparameter values, and direct comparisons to single-stage RL baselines under matched compute budgets, making it impossible to verify that the coarse-to-fine strategy is the load-bearing factor for the reported MMMU/MathVista lifts.
  3. [§3.2] §3.2 (ViR) and efficiency results: The dynamic resolution routing mechanism is claimed to deliver 4.05× speedup without performance loss, yet no controlled measurements (e.g., latency breakdowns, token counts per resolution choice, or comparisons against static high-resolution baselines) are supplied to isolate ViR’s contribution from the DvD deployment strategy or hardware configuration.
minor comments (2)
  1. [Abstract] The abstract states specific percentage gains and speedups but supplies no error bars, number of runs, or baseline model versions; these details should be added to the experimental tables for clarity.
  2. [Introduction] Notation for the 241B-A28B model (active vs. total parameters) is introduced without an explicit definition or diagram in the main text; a short table or footnote would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. Below, we provide point-by-point responses to the major comments and outline the revisions we will make to address them.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline claims of +16% reasoning improvement and SOTA open-source results are presented without ablation studies that hold model size (241B-A28B), data mixture, and total training compute fixed while removing or replacing Cascade RL with standard RL/SFT and ViR with fixed-resolution routing. This leaves open the possibility that gains are driven by unstated increases in scale or data quality rather than the two named techniques.

    Authors: We agree that more rigorous ablations would be ideal. However, performing experiments that strictly hold model size, data mixture, and total training compute fixed at the 241B scale while ablating Cascade RL and ViR is extremely resource-intensive. Instead, we have conducted ablations on smaller models and compared against our previous InternVL3 model, which was trained under similar conditions. We will expand Section 4 with additional ablation results and a discussion of these constraints in the revised manuscript. revision: partial

  2. Referee: [§3.1] §3.1 (Cascade RL): The two-stage offline-then-online RL procedure is described at a high level but lacks concrete implementation details, reward model specifications, hyperparameter values, and direct comparisons to single-stage RL baselines under matched compute budgets, making it impossible to verify that the coarse-to-fine strategy is the load-bearing factor for the reported MMMU/MathVista lifts.

    Authors: We agree that additional implementation details would enhance reproducibility. We will provide the requested concrete implementation details, including reward model specifications and hyperparameter values, along with comparisons to single-stage RL on matched budgets using smaller models, in the revised manuscript. revision: yes

  3. Referee: [§3.2] §3.2 (ViR) and efficiency results: The dynamic resolution routing mechanism is claimed to deliver 4.05× speedup without performance loss, yet no controlled measurements (e.g., latency breakdowns, token counts per resolution choice, or comparisons against static high-resolution baselines) are supplied to isolate ViR’s contribution from the DvD deployment strategy or hardware configuration.

    Authors: We acknowledge the need for more granular efficiency analysis. We will supplement the efficiency analysis in §3.2 with controlled measurements, including latency breakdowns, token counts per resolution, and comparisons to static baselines, to better isolate ViR's contribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results on external public datasets

full rationale

The paper reports performance numbers for InternVL3.5 models on standard external benchmarks (MMMU, MathVista, agentic tasks) and attributes gains to the introduced Cascade RL and ViR techniques. These are empirical outcomes of training and evaluation rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation that reduces the central claim to its own inputs by construction. No equations, self-definitional loops, or load-bearing uniqueness theorems from prior author work are invoked to force the reported results. The SOTA claims rest on independently verifiable public leaderboards.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions of reinforcement learning for alignment and dynamic routing for efficiency; no new physical entities are postulated.

free parameters (1)
  • RL training hyperparameters
    Typical learning rates, batch sizes, and reward scaling factors in the Cascade RL stages are not enumerated in the abstract.
axioms (1)
  • domain assumption Offline RL followed by online RL produces stable and refined reasoning improvements in multimodal models
    Invoked as the core of the Cascade RL framework without further justification in the abstract.

pith-pipeline@v0.9.0 · 5858 in / 1269 out tokens · 43130 ms · 2026-05-10T11:53:53.190841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 8.0

    SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

  2. EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

    cs.CV 2026-05 unverdicted novelty 8.0

    EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

  3. RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

    cs.AI 2026-05 unverdicted novelty 8.0

    RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

  4. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  5. MedHorizon: Towards Long-context Medical Video Understanding in the Wild

    cs.CV 2026-05 unverdicted novelty 8.0

    MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

  6. Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

    cs.CV 2026-05 unverdicted novelty 8.0

    Creates the first benchmark dataset integrating papers, slides, videos, and presentations for evaluating AI models on fine-grained multimodal correspondences in science.

  7. From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

    cs.SE 2026-04 unverdicted novelty 8.0

    MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusin...

  8. Lost in Translation: Do LVLM Judges Generalize Across Languages?

    cs.CL 2026-04 unverdicted novelty 8.0

    MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

  9. ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

    cs.CV 2026-04 unverdicted novelty 8.0

    ViPS distills a compact, controllable distribution of valid joint configurations for any auto-rigged mesh from video diffusion priors, matching 4D-trained methods in plausibility while generalizing zero-shot to unseen...

  10. When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

    cs.CV 2026-04 unverdicted novelty 8.0

    VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...

  11. RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

    cs.CV 2026-04 unverdicted novelty 8.0

    RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

  12. PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

    cs.CV 2026-04 unverdicted novelty 8.0

    PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.

  13. SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

    cs.CV 2026-05 unverdicted novelty 7.0

    SceneFunRI benchmark shows current VLMs struggle severely with inferring locations of invisible functional objects, with the strongest model (Gemini 3 Flash) reaching only 15.20 CAcc@75.

  14. GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...

  15. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  16. RotVLA: Rotational Latent Action for Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 7.0

    RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

  17. Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...

  18. ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

  19. ImageAttributionBench: How Far Are We from Generalizable Attribution?

    cs.CV 2026-05 unverdicted novelty 7.0

    ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

  20. From Web to Pixels: Bringing Agentic Search into Visual Perception

    cs.CV 2026-05 unverdicted novelty 7.0

    WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

  21. UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.

  22. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  23. Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

    cs.CV 2026-05 unverdicted novelty 7.0

    Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.

  24. Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

    cs.CL 2026-05 unverdicted novelty 7.0

    BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...

  25. Count Anything at Any Granularity

    cs.CV 2026-05 unverdicted novelty 7.0

    Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...

  26. GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.

  27. V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

  28. ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

  29. TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.

  30. MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single mode...

  31. Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

    cs.CL 2026-05 unverdicted novelty 7.0

    LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

  32. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.

  33. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.

  34. SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.

  35. EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

    cs.CV 2026-05 unverdicted novelty 7.0

    EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

  36. ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

    cs.CV 2026-05 unverdicted novelty 7.0

    ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.

  37. Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.

  38. Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...

  39. CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    CrossCult-KIBench provides 9,800 test cases for cross-cultural knowledge insertion in MLLMs and shows that existing methods cannot reliably adapt to one culture while preserving behavior in others.

  40. CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.

  41. TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity

    cs.CL 2026-05 unverdicted novelty 7.0

    TableVista benchmark finds foundation models maintain performance across visual styles but degrade sharply on complex table structures and vision-only settings.

  42. MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition

    cs.AI 2026-05 unverdicted novelty 7.0

    MolRecBench-Wild reveals that 18 existing OCSR models suffer severe performance drops on complex real-world academic molecular images compared with prior patent benchmarks.

  43. Visual Text Compression as Measure Transport

    cs.CV 2026-05 unverdicted novelty 7.0

    Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using ...

  44. StableI2I: Spotting Unintended Changes in Image-to-Image Transition

    cs.CV 2026-05 unverdicted novelty 7.0

    StableI2I is a unified evaluation framework and benchmark that measures content fidelity and pre-post consistency in image-to-image tasks without needing reference images, showing strong correlation with human judgments.

  45. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  46. VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

    cs.CV 2026-05 unverdicted novelty 7.0

    VideoNet is a new large-scale benchmark and training dataset for domain-specific action recognition that exposes limitations in VLMs and enables smaller fine-tuned models to surpass larger open-weight ones.

  47. Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

    cs.CL 2026-05 unverdicted novelty 7.0

    Prompt reformulations induce high variance in first-token safety probabilities from zero-shot VLMs, and a training-free mean ensemble over prompt families improves NLL on all tested pairs and ECE on most relative to s...

  48. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  49. Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

    cs.IR 2026-04 unverdicted novelty 7.0

    FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.

  50. Don't Pause! Every prediction matters in a streaming video

    cs.CV 2026-04 unverdicted novelty 7.0

    SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.

  51. EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.

  52. Can Multimodal Large Language Models Truly Understand Small Objects?

    cs.CV 2026-04 unverdicted novelty 7.0

    Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.

  53. Grounding Video Reasoning in Physical Signals

    cs.CV 2026-04 unverdicted novelty 7.0

    A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...

  54. Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces the diagnosis-driven CE video summarization task, the VideoCAP dataset with 240 annotated videos, and the DiCE framework that outperforms prior methods by screening candidates then weaving them into diagnos...

  55. Hybrid Latent Reasoning with Decoupled Policy Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.

  56. SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

    cs.CV 2026-04 unverdicted novelty 7.0

    SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Kn...

  57. EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

    cs.CV 2026-04 unverdicted novelty 7.0

    EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

  58. ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

    cs.CR 2026-04 unverdicted novelty 7.0

    ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...

  59. MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

    cs.CL 2026-04 unverdicted novelty 7.0

    MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.

  60. E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes

    cs.CV 2026-04 unverdicted novelty 7.0

    E3VS-Bench supplies 99 3D Gaussian Splatting scenes and 2,014 episodes to test whether embodied agents can use unrestricted 5-DoF viewpoint control to answer questions that depend on fine-grained visual details visibl...

Reference graph

Works this paper leans on

190 extracted references · 190 canonical work pages · cited by 182 Pith papers · 50 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com, 2024. 11, 12, 13, 14, 18, 19, 21

  3. [3]

    Introducing claude 4: Claude sonnet 4 and claude opus 4, May 2025

    Anthropic. Introducing claude 4: Claude sonnet 4 and claude opus 4, May 2025. 21

  4. [4]

    Claude 3.7 sonnet system card

    Sonnet Anthropic. Claude 3.7 sonnet system card. 2025. 18

  5. [5]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 2, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21

  6. [6]

    Smollm-corpus, July

    Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra. Smollm-corpus, July

  7. [7]

    Bonatti, D

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264, 2024. 18

  8. [8]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 5

  9. [9]

    Internlm2 technical report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 4, 20

  10. [10]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 15

  11. [11]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024. 2, 8, 14

  12. [12]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 1, 21

  13. [13]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024. 2, 3, 4, 5, 7, 11

  14. [14]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 3

  15. [15]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2, 3

  16. [16]

    Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024. 10, 18

  17. [17]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 17

  18. [18]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 20

  19. [19]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 21 25

  20. [20]

    Opencompass: A universal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https: //github.com/open-compass/opencompass, 2023. 9, 11, 12, 13, 14, 22

  21. [21]

    Xtuner: A toolkit for efficiently fine-tuning llm

    XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/ xtuner, 2023. 7

  22. [22]

    Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model

    X.AI Corp. Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model. https://x.ai/blog/grok-1.5v, 2024. 13

  23. [23]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025. 1

  24. [24]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025. 1

  25. [25]

    NVLM: Open frontier-class multimodal LLMs

    Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Moham- mad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint arXiv:2409.11402, 2024. 12

  26. [26]

    FlashAttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024. 7

  27. [27]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 7

  28. [28]

    Gemini 2.5 pro

    DeepMind. Gemini 2.5 pro. https://deepmind.google/technologies/gemini/pro/, 2025. 18

  29. [29]

    Introducing gemini 2.0: our new ai model for the agentic era

    Google Deepmind. Introducing gemini 2.0: our new ai model for the agentic era. https://blog.google/ technology/google-deepmind/google-gemini-ai-update-december-2024/ , 2024. 11, 18, 19

  30. [30]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024. 12

  31. [31]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 8, 9

  32. [32]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  33. [33]

    Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. arXiv preprint arXiv:2406.14515, 2024. 17

  34. [34]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 14

  35. [35]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 2, 8, 17

  36. [36]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024. 2, 12, 13

  37. [37]

    Mini-internvl: A flexible-transfer pocket multimodal model with 5% parameters and 90% performance

    Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: A flexible-transfer pocket multimodal model with 5% parameters and 90% performance. arXiv preprint arXiv:2410.16261, 2024. 2, 3

  38. [38]

    Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence.ArXiv, abs/2506.07966, 2025

    Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, and Rongrong Ji. Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence. arXiv preprint arXiv:2506.07966, 2025. 1, 2, 10, 19

  39. [39]

    Navigating the digital world as humans do: Universal visual grounding for GUI agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. 18 26

  40. [40]

    Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data

    Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data. arXiv preprint arXiv:2410.18558, 2024. 12

  41. [41]

    HallusionBench: An advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models.arXiv preprint arXiv:2310.14566, 2023

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023. 2, 8, 14, 15

  42. [42]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062, 2025. 11, 18

  43. [43]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024. 1, 8

  44. [44]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In The International Conference on Learning Representations,

  45. [45]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 202...

  46. [46]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025. 1, 3, 9, 10, 11, 12, 13, 14, 16, 18, 19, 20, 21

  47. [47]

    Liger-kernel: Efficient triton kernels for llm training

    Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, and Zhipeng Wang. Liger-kernel: Efficient triton kernels for llm training. In Championing Open-source DEvelopment in ML Workshop @ ICML25, 2025. 7

  48. [48]

    Os agents: A survey on mllm-based agents for general computing devices use.arXiv preprint arXiv:2508.04482, 2025

    Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al. Os agents: A survey on mllm-based agents for general computing devices use. arXiv preprint arXiv:2508.04482, 2025. 1

  49. [49]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024. 1, 20

  50. [50]

    OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. arXiv preprint arXiv:2506.03135,

  51. [51]

    Mantis: Interleaved multi-image instruction tuning

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024. 12, 13

  52. [52]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017. 20

  53. [53]

    Binary classifier optimization for large language model alignment

    Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. arXiv preprint arXiv:2404.04656, 2024. 5

  54. [54]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 787–798, 2014. 15

  55. [55]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In European Conference on Computer Vision, pages 235–251, 2016. 2, 8, 11, 12

  56. [56]

    Natural questions: a benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 20

  57. [57]

    RACE: Large-scale ReAding Comprehension Dataset From Examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017. 20

  58. [58]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 17 27

  59. [59]

    Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

    Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension. arXiv preprint arXiv:2404.16790, 2024. 11, 12

  60. [60]

    R-bench: Are your large multimodal model robust to real-world corruptions?arXiv preprint arXiv:2410.05474, 2024

    Chunyi Li, Jianbo Zhang, Zicheng Zhang, Haoning Wu, Yuan Tian, Wei Sun, Guo Lu, Xiaohong Liu, Xiongkuo Min, Weisi Lin, et al. R-bench: Are your large multimodal model robust to real-world corruptions? arXiv preprint arXiv:2410.05474, 2024. 13

  61. [61]

    CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023. 10, 20

  62. [62]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 17

  63. [63]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 2, 8, 17

  64. [64]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In The Conference on Empirical Methods in Natural Language Processing, pages 292–305, 2023. 14, 15

  65. [65]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023. 7, 8

  66. [66]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 12, 17

  67. [67]

    Showui: One vision-language-action model for gui visual agent, 2024

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent, 2024. 18

  68. [68]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 7, 21

  69. [69]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2025. 15

  70. [70]

    Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models.arXiv preprint arXiv:2409.02834, 2024

    Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, and Liang He. Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models. arXiv preprint arXiv:2409.02834, 2024. 7

  71. [71]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281,

  72. [72]

    On the hidden mystery of ocr in large multimodal models

    Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023. 2, 8, 11, 12

  73. [73]

    Acemath: Advancing frontier math reasoning with post-training and reward modeling

    Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling. arXiv preprint, 2024. 4

  74. [74]

    Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

    Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On-demand spatial- temporal understanding at arbitrary resolution. arXiv preprint arXiv:2409.12961, 2024. 17

  75. [75]

    Scp-116k: A high-quality problem-solution dataset and a generalized pipeline for automated extraction in the higher education science domain, 2025

    Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, and Yuan Qi. Scp-116k: A high-quality problem-solution dataset and a generalized pipeline for automated extraction in the higher education science domain, 2025. 4

  76. [76]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023. 1, 2, 8, 10, 11

  77. [77]

    Ovis: Structural embedding alignment for multimodal large language model, 2024.arXiv preprint arXiv:2405.20797, 2024

    Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model. arXiv preprint arXiv:2405.20797, 2024. 11, 12 28

  78. [78]

    Wildvision: Evaluating vision-language models in the wild with human preferences,

    Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision: Evaluating vision-language models in the wild with human preferences. arXiv preprint arXiv:2406.11069, 2024. 8, 13

  79. [79]

    Mono-internvl-1.5: Towards cheaper and faster monolithic multimodal large language models

    Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, and Jifeng Dai. Mono-internvl-1.5: Towards cheaper and faster monolithic multimodal large language models. arXiv preprint arXiv:2507.12566, 2025. 2

  80. [80]

    Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training.arXiv preprint arXiv:2410.08202, 2024

    Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jifeng Dai, Yu Qiao, and Xizhou Zhu. Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. arXiv preprint arXiv:2410.08202, 2024. 2, 10

Showing first 80 references.