super hub Mixed citations

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Hengjun Pu, Lixin Gu, Long Cui, Weiyun Wang, Xingguang Wei, Zhangwei Gao · 2025 · cs.CV · arXiv 2508.18265

Mixed citation behavior. Most common role is background (49%).

277 Pith papers citing it

Background 49% of classified citations

open full Pith review browse 277 citing papers more from Hengjun Pu arXiv PDF

abstract

We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05$\times$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 52 baseline 41 method 12 dataset 1 other 1

citation-polarity summary

background 52 baseline 41 use method 11 unclear 2 use dataset 1

claims ledger

abstract We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynami

authors

Hengjun Pu Lixin Gu Long Cui Weiyun Wang Xingguang Wei Zhangwei Gao

co-cited works

representative citing papers

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models

cs.CV · 2026-05-16 · unverdicted · novelty 8.0

EPIC-Bench is a new fine-grained benchmark that shows leading VLMs struggle with multi-target counting, part-whole relations, and affordance detection in real-world embodied visual grounding tasks.

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

cs.CV · 2026-05-15 · conditional · novelty 8.0

MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

cs.AI · 2026-05-14 · unverdicted · novelty 8.0 · 2 refs

LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

cs.CV · 2026-05-07 · unverdicted · novelty 8.0 · 2 refs

Creates the first benchmark dataset integrating papers, slides, videos, and presentations for evaluating AI models on fine-grained multimodal correspondences in science.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage

cs.CL · 2026-03-18 · unverdicted · novelty 8.0

Dental-TriageBench is the first expert-annotated multimodal benchmark for hierarchical dental triage and shows a substantial performance gap between 19 MLLMs and junior dentists, especially on multi-domain referral cases.

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

cs.CV · 2026-03-16 · accept · novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.

Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces Synergistic Faithfulness metric based on Shapley Interaction Index to evaluate cross-modal synergy in VLM explainers, revealing over-reliance on visual salience in existing methods.

AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

cs.CV · 2026-05-21 · accept · novelty 7.0

AgroVG is a new multi-source benchmark for agricultural visual grounding formulated as generalized set prediction, with protocols for box and mask grounding across single-target, multi-target, and target-absent queries from six object families.

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.

citing papers explorer

Showing 12 of 12 citing papers after filters.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis cs.CV · 2026-05-21 · unverdicted · none · ref 74 · internal anchor
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
RotVLA: Rotational Latent Action for Vision-Language-Action Model cs.RO · 2026-05-13 · unverdicted · none · ref 38 · internal anchor
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement cs.CV · 2026-05-08 · unverdicted · none · ref 49 · internal anchor
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 108 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
THOM: Generating Physically Plausible Hand-Object Meshes From Text cs.CV · 2026-04-03 · unverdicted · none · ref 65 · internal anchor
THOM is a training-free two-stage framework that generates physically plausible hand-object 3D meshes directly from text by combining text-guided Gaussians with contact-aware physics optimization and VLM refinement.
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation cs.CV · 2026-05-11 · unverdicted · none · ref 64 · internal anchor
SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.
What Limits Vision-and-Language Navigation ? cs.RO · 2026-05-13 · unverdicted · none · ref 52 · internal anchor
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph cs.CV · 2026-05-11 · unverdicted · none · ref 41 · internal anchor
MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 68 · internal anchor
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection cs.CV · 2026-05-05 · unverdicted · none · ref 42 · 3 links · internal anchor
VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended settings.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory cs.CV · 2026-04-10 · unverdicted · none · ref 43 · internal anchor
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive distillation on a 5B model.
Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models cs.RO · 2026-04-09 · unverdicted · none · ref 87 · internal anchor
This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer