super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

717 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 717 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 120 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 113 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

citing papers explorer

Showing 50 of 118 citing papers after filters.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings cs.CV · 2026-05-08 · conditional · none · ref 11 · internal anchor
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models cs.CV · 2026-04-19 · unverdicted · none · ref 3 · internal anchor
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
Common to Whom? Regional Cultural Commonsense and LLM Bias in India cs.CL · 2026-01-22 · unverdicted · none · ref 5 · internal anchor
Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters cs.CV · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs cs.CV · 2026-05-12 · unverdicted · none · ref 24 · internal anchor
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning cs.MM · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.
BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training cs.DC · 2026-05-11 · unverdicted · none · ref 7 · 2 links · internal anchor
BatchWeave delivers an object-store-native data plane for distributed large foundation model training via transactional global batches and a decentralized adaptive commit algorithm.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 1 · 2 links · internal anchor
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV · 2026-05-10 · unverdicted · none · ref 2 · internal anchor
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 2 · 2 links · internal anchor
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 3 · internal anchor
STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection cs.CL · 2026-05-09 · accept · none · ref 5 · internal anchor
CiteTracer detects citation hallucinations at 97.1% accuracy on synthetic and real-world benchmarks by combining structured extraction, multi-source retrieval, deterministic matching, and class-specialist agents.
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents cs.RO · 2026-05-08 · unverdicted · none · ref 51 · 2 links · internal anchor
MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning q-bio.GN · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
OmicsLM integrates continuous omics embeddings into LLMs for multi-sample biological reasoning, matching specialized models on profile tasks while outperforming them and general LLMs on language-guided QA over real expression data.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV · 2026-05-05 · unverdicted · none · ref 18 · 3 links · internal anchor
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
Benchmarking and Improving GUI Agents in High-Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 1 · 2 links · internal anchor
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents cs.CL · 2026-04-27 · unverdicted · none · ref 89 · internal anchor
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding eess.AS · 2026-04-24 · unverdicted · none · ref 35 · internal anchor
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs cs.CV · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
S-GRPO: Unified Post-Training for Large Vision-Language Models cs.LG · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 5 · internal anchor
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding cs.CV · 2026-04-13 · unverdicted · none · ref 6 · internal anchor
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote sensing interpretation.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation cs.CL · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning cs.RO · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning cs.CV · 2026-04-08 · unverdicted · none · ref 2 · internal anchor
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables cs.AI · 2026-04-04 · conditional · none · ref 2 · internal anchor
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
PortraitCraft: A Benchmark for Portrait Composition Understanding and Generation cs.CV · 2026-04-04 · unverdicted · none · ref 1 · internal anchor
PortraitCraft supplies a new 50k-image dataset and two tasks for evaluating AI on fine-grained portrait composition understanding and constrained generation.
QuadAgent: A Responsive Agent System for Vision-Language Guided Quadrotor Agile Flight cs.RO · 2026-04-03 · unverdicted · none · ref 32 · internal anchor
QuadAgent uses an asynchronous multi-agent architecture with an Impression Graph for scene memory and vision-based avoidance to enable training-free vision-language guided agile quadrotor flight, outperforming baselines in simulations and achieving real-world speeds up to 5 m/s.
Internalized Reasoning for Long-Context Visual Document Understanding cs.CV · 2026-03-31 · unverdicted · none · ref 44 · internal anchor
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators cs.CV · 2026-03-31 · unverdicted · none · ref 2 · internal anchor
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding cs.CL · 2026-02-02 · unverdicted · none · ref 13 · internal anchor
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval cs.CL · 2026-05-04 · unverdicted · none · ref 8
Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations in novelty prediction.
Cambrian-P: Pose-Grounded Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 7 · internal anchor
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 4 · 2 links · internal anchor
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context cs.CV · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs cs.CV · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
LDDR proposes a linear DPP-based dynamic-resolution frame sampler that achieves 3x speedup and up to 2.5-point gains on video MLLM benchmarks by selecting non-redundant frames and allocating tokens accordingly.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer cs.CV · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation cs.AI · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
Reinforcing Multimodal Reasoning Against Visual Degradation cs.CV · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer