super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

831 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 831 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 114 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

GaussDet enables open-vocabulary and referring segmentation in 3D Gaussians by learning instance features and aggregating votes from 2D detectors, improving referential grounding by 16.7% mIoU in zero-shot setting.

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

Goku supplies a 2M-scale dataset, synthesis pipeline, decoupled dual-branch model, and 1000-case benchmark for multi-task instruction-based video editing, reporting up to 8% gains in instruction following.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

citing papers explorer

Showing 50 of 831 citing papers.

Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks cs.CV · 2026-04-05 · conditional · none · ref 15 · internal anchor
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models even on report generation.
FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation cs.RO · 2026-04-03 · unverdicted · none · ref 34 · internal anchor
FSUNav's dual brain-inspired modules achieve state-of-the-art zero-shot goal navigation across heterogeneous robots with improved speed, safety, and generalization.
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 3 · internal anchor
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding cs.AI · 2026-04-03 · unverdicted · none · ref 3 · internal anchor
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
LumiVideo: An Intelligent Agentic System for Video Color Grading cs.CV · 2026-04-02 · unverdicted · none · ref 2 · internal anchor
LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.
Multimodal Language Models Cannot Spot Spatial Inconsistencies cs.CV · 2026-04-01 · unverdicted · none · ref 6 · internal anchor
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA cs.RO · 2026-03-31 · unverdicted · none · ref 4 · internal anchor
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing cs.CV · 2026-03-25 · conditional · none · ref 2 · internal anchor
PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training cs.CV · 2026-03-25 · unverdicted · none · ref 3 · internal anchor
A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
Teaching an Agent to Sketch One Part at a Time cs.AI · 2026-03-19 · unverdicted · none · ref 2 · internal anchor
A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.
RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization cs.CV · 2026-03-13 · unverdicted · none · ref 3 · internal anchor
A dual-tower 4D embodied world model called RoboStereo reduces geometric hallucinations and delivers over 97% relative improvement on manipulation tasks via test-time augmentation, imitative learning, and open exploration.
Logics-Parsing-Omni Technical Report cs.AI · 2026-03-10 · unverdicted · none · ref 4 · internal anchor
Omni Parsing framework converts complex multimodal signals into locatable, enumerable, and traceable structured knowledge via hierarchical detection, recognition, and interpreting with strict evidence alignment.
TIQA: Human-Aligned Perceptual Text Quality Assessment in Generated Images cs.CV · 2026-03-07 · unverdicted · none · ref 8 · internal anchor
TIQA introduces datasets and a model that predict human perceptual quality of rendered text in AI images, achieving PLCC 0.942 on crops and improving selected image text quality by 0.36 MOS.
Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks cs.CV · 2026-03-04 · unverdicted · none · ref 1 · internal anchor
PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons cs.RO · 2026-03-02 · unverdicted · none · ref 124 · internal anchor
Robometer combines intra-trajectory progress supervision with inter-trajectory preference supervision on a 1M-trajectory dataset to learn more generalizable robotic reward functions than prior methods.
Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine cs.CV · 2026-03-02 · conditional · none · ref 2 · internal anchor
Chain-of-thought underperforms direct answering in medical VQA due to a perception bottleneck, but ROI cues and textual grounding interventions can improve results and reverse the gap.
HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents cs.AI · 2026-03-01 · unverdicted · none · ref 5 · internal anchor
HiMAC decomposes LLM agent tasks into macro planning and micro execution using critic-free hierarchical RL and iterative co-evolution, outperforming baselines on ALFWorld, WebShop, and Sokoban.
CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era cs.CL · 2026-02-26 · unverdicted · none · ref 9 · internal anchor
CiteAudit supplies a human-validated benchmark and multi-agent verification system that outperforms existing LLMs and commercial tools at detecting hallucinated scientific references.
VLANeXt: Recipes for Building Strong VLA Models cs.CV · 2026-02-20 · conditional · none · ref 1 · internal anchor
VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning cs.CV · 2026-02-11 · unverdicted · none · ref 1 · internal anchor
ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning cs.RO · 2026-02-09 · unverdicted · none · ref 18 · internal anchor
R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation cs.CV · 2026-02-05 · unverdicted · none · ref 4 · internal anchor
MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.
Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training cs.CL · 2026-02-04 · unverdicted · none · ref 7 · internal anchor
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding cs.CL · 2026-01-29 · unverdicted · none · ref 2 · internal anchor
CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.
Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework cs.CV · 2026-01-27 · unverdicted · none · ref 4 · internal anchor
Presents the first management-oriented multi-modal benchmark GovLA-10K and a vision-language reasoning framework GovLA-Reasoner with a spatially-aware adapter for low-altitude aerial perception.
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding cs.CV · 2026-01-21 · unverdicted · none · ref 5 · internal anchor
HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch cs.CV · 2026-01-20 · conditional · none · ref 3 · internal anchor
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos cs.CV · 2026-01-11 · conditional · none · ref 1 · internal anchor
VLMs exhibit demographic biases in occupation and salary decisions even when only faces are altered in otherwise identical real photos.
Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding cs.RO · 2025-12-27 · conditional · none · ref 16 · internal anchor
OBEYED-VLA improves VLA robustness in cluttered real-world manipulation by disentangling perception into VLM-based object-centric grounding and geometry-aware stages, then fine-tuning the policy only on single-object demonstrations.
Mull-Tokens: Modality-Agnostic Latent Thinking cs.CV · 2025-12-11 · unverdicted · none · ref 2 · internal anchor
Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO cs.CV · 2025-03-12 · unverdicted · none · ref 1 · internal anchor
FaVChat proposes hierarchical prompt-query guided visual features and Data-Efficient GRPO for efficient training, plus the FaVChat-170K dataset, claiming consistent outperformance over prior VLLMs on facial video tasks.
Augmenting Interface Usability Heuristics for Reliable Computer-Use Agents cs.HC · 2026-05-04 · unverdicted · none · ref 2
Augmented Nielsen heuristics improve computer-use agent task completion on varied interfaces while preserving human usability, as shown in UI-Verse experiments and human studies.
DOPD: Dual On-policy Distillation cs.AI · 2026-06-29 · unverdicted · none · ref 3 · internal anchor
DOPD is an advantage-aware dual distillation method that dynamically assigns token supervision from either privileged teacher or student to transfer capability while mitigating non-replicable information asymmetry in on-policy distillation.
APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms cs.CV · 2026-06-29 · unverdicted · none · ref 7 · internal anchor
APRIL-MedSeg is a new open-source modular toolbox that uses YAML configuration and component registries to unify multiple advanced paradigms for medical image segmentation.
On the Faithfulness of Post-Hoc Concept Bottleneck Models cs.CV · 2026-06-29 · unverdicted · none · ref 4 · internal anchor
Post-hoc CBMs produce unfaithful concept projections due to covariate shifts and systematic label noise; new metrics are introduced to measure faithfulness separately from accuracy.
REAR: Test-time Preference Realignment through Reward Decomposition cs.CL · 2026-06-29 · unverdicted · none · ref 102 · internal anchor
REAR decomposes the reward into question and preference components, rescales their balance, and expresses the result as a linear combination of token log-probabilities for efficient integration with best-of-N and tree search.
Be Faithful When Response: Returning Fluent and Grounded Answers for Vision-Language Models Reinforcement Learning cs.AI · 2026-06-29 · unverdicted · none · ref 12 · internal anchor
Faithful Warm-Start pre-training on causally consistent vision-language samples improves accuracy, stabilizes RL, and reduces unsupported reasoning in VLMs.
OpenSPM: An Environment-Transferable Robotic Key Spatial Pose Memory and Closed-Loop High-Frequency Flow-Matching Action Generation Model cs.RO · 2026-06-29 · unverdicted · none · ref 28 · internal anchor
OpenSPM extracts key spatial poses from demonstrations as transferable memory and generates high-frequency actions via flow-matching, achieving 85.6% success rate on ten LIBERO-GOAL tasks at 1033.3 Hz.
Clearer Sight, Fewer Lies: Oriented Pickup Preference Optimization for Multimodal Hallucination Mitigation cs.CV · 2026-06-29 · unverdicted · none · ref 1 · internal anchor
OPPO is an evidence-aware preference optimization that contrasts faithful responses under varying visual evidence strengths to reduce hallucinations in MLLMs.
MR-IQA: A Unified Margin View of Regression and Ranking for Blind Image Quality Assessment cs.CV · 2026-06-29 · unverdicted · none · ref 2 · internal anchor
Regression and ranking in BIQA both fit quality margins, enabling a unified margin optimization framework called MR-IQA for RL-based assessment.
SAD-GS: Learning Reliable 3D Semantic Gaussian Fields via Dynamic Geo-Semantic Anchoring cs.CV · 2026-06-28 · unverdicted · none · ref 1 · internal anchor
SAD-GS proposes dynamic geo-semantic anchoring via SAD and GSFL to learn reliable 3D semantic Gaussian fields, reporting best performance on LERF-OVS, 3D-OVS, and Mip-NeRF360 for open-vocabulary localization and segmentation.
TextDS: Parameter-Efficient Representation Alignment for Scene Text Detection under Distribution Shifts cs.CV · 2026-06-26 · unverdicted · none · ref 3 · internal anchor
TextDS uses a data-efficient dual-encoder with SWLoRA and CSF to achieve competitive scene text detection robustness under distribution shifts and adverse conditions using 4.9M trainable parameters.
ReScene: Structured Indoor Scene Reconstruction from Multi-View Captures cs.CV · 2026-06-26 · unverdicted · none · ref 62 · internal anchor
ReScene introduces HierView for view prioritization and Relation-Aware Assembly for scene graph fusion, reporting 17% lower Chamfer Distance and 26% lower LPIPS than prior baselines on ScanNet while running faster.
S$^2$-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation cs.RO · 2026-06-26 · unverdicted · none · ref 24 · internal anchor
S²-VLA uses a state-space model to maintain a belief state that produces dynamic gating weights for fusing visual, language, and action features, claiming better long-horizon manipulation than 7B models with only 2B parameters.
Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety cs.CV · 2026-06-23 · unverdicted · none · ref 2 · internal anchor
Yuvion VL is a multimodal LLM family using adversarial-aware data construction, three-stage training, and contrastive fine-tuning that claims industry-leading safety performance on new benchmarks while retaining general capabilities.
Towards Fast and Effective Long Video Understanding of Multimodal Large Language Models via Adaptive Quasi-Gaussian Sampling cs.CV · 2026-06-23 · unverdicted · none · ref 1 · internal anchor
AdaQ is a training-free adaptive quasi-Gaussian sampling method for keyframe selection that improves long-video understanding in MLLMs and can outperform GPT-4o with 64 frames.
Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models cs.RO · 2026-06-09 · unverdicted · none · ref 3 · internal anchor
Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.
Agent Skills Should Go Beyond Text: The Case for Visual Skills cs.CV · 2026-05-31 · unverdicted · none · ref 3 · internal anchor
The paper proposes that reusable agent skills should incorporate visual elements alongside text, introduces three forms of visual skills and an automatic conversion system, and reports better performance on GUI and visual-centric tasks.
Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition cs.AI · 2026-05-31 · unverdicted · none · ref 67 · internal anchor
PID applied to MLLMs identifies task-specific modality interaction profiles that generalize across models, extend to tri-modal cases, and yield initial performance gains via reweighting.
Linear Scaling Video VLMs for Long Video Understanding cs.CV · 2026-05-29 · unverdicted · none · ref 5 · internal anchor
StateKV is an inference-time technique that replaces quadratic self-attention prefill in video VLMs with a fixed-capacity importance-based recurrent state, keeping accuracy near full attention on long-video benchmarks without retraining.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer