mega hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (48%).

1181 Pith papers citing it

Background 48% of classified citations

open full Pith review browse 1181 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 115 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

One Video, One World: Turning Monocular Video into Physical 4D Scenes

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents

cs.MM · 2026-06-26 · unverdicted · novelty 8.0

Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.

MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models

cs.CV · 2026-06-19 · unverdicted · novelty 8.0

Introduces the first large-scale multimodal benchmark MedLayXPlain-122K showing medical VLMs suffer significant lay-register degradation while general VLMs lack clinical precision.

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

LOCUS is a released corpus of nearly all US municipal and county ordinance codes, processed via OCR and paired with ModernBERT classifiers for dimensions such as opacity and paternalism.

Vision-language models for chest radiography do not always need the image

cs.CV · 2026-06-16 · accept · novelty 8.0

A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

cs.CL · 2026-06-01 · conditional · novelty 8.0

FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

citing papers explorer

Showing 50 of 762 citing papers after filters.

ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation cs.CV · 2026-06-10 · unverdicted · none · ref 39 · internal anchor
ARGUS converts MLLM-selected identity evidence into a synchronized 3x3 mosaic injected as negative-time memory in a diffusion model, plus supporting training techniques, to achieve SOTA subject preservation on human video benchmarks.
CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence cs.CV · 2026-06-09 · unverdicted · none · ref 30 · internal anchor
CoCoSI is a training-free multi-agent system for collaborative cognitive map construction that improves spatial understanding in arbitrary pretrained MLLMs.
Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur? cs.CV · 2026-06-08 · unverdicted · none · ref 13 · internal anchor
Introduces Ego-MC-Bench benchmark and Ego-CoMist synthetic dataset showing that fine-tuning video LLMs on proactive mistake corrections improves performance especially for smaller models.
Temporal-Aware Reasoning Optimization for Video Temporal Grounding cs.CV · 2026-06-08 · unverdicted · none · ref 93 · internal anchor
TaRO improves video temporal grounding in MLLMs via constructive reasoning exploration from dense captions and a temporal-sensitivity reward that uses logit drops on disrupted event boundaries, followed by curriculum learning to SOTA results.
HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging cs.CV · 2026-06-08 · unverdicted · none · ref 49 · internal anchor
HDRAgent is the first agent-driven framework for multi-exposure HDR imaging that uses MLLM scene perception, contextual knowledge matching, and perception-distortion feedback to reduce ghosting artifacts.
Harnessing Streaming Video in the Wild cs.CV · 2026-06-07 · unverdicted · none · ref 2 · internal anchor
Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.
Streaming Video Generation with Streaming Force Control cs.CV · 2026-06-05 · unverdicted · none · ref 3 · internal anchor
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning cs.CV · 2026-06-05 · unverdicted · none · ref 1 · internal anchor
VeriDrive introduces a verifiable counterfactual supervision framework using a Perception-Evaluation-Revision chain and validator-guided correction to generate cost-efficient structured data for vision-language driving models, showing metric gains on nuScenes.
Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors cs.CV · 2026-06-05 · unverdicted · none · ref 2 · internal anchor
Stream3D-VLM adds autoregressive streaming control, VSFI geometry integration, GAVC compression, and a 1M-pair benchmark to enable real-time 3D VLM performance that beats prior models on 29 online and offline tasks.
Diagnosing Visual Ignorance in Vision-Language Models cs.CV · 2026-06-05 · unverdicted · none · ref 43 · internal anchor
VLMs show language-prior reliance via multi-stage bottlenecks in visual retrieval and suppression, with many benchmark examples remaining answerable under severe visual obfuscation.
PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding cs.CV · 2026-06-04 · unverdicted · none · ref 3 · internal anchor
PAR3D is a part-aware 3D-MLLM framework with ScenePart dataset, Part-Aware 3D Representation Learning, and Hierarchical Segmentation Query Generation to improve part-level 3D scene understanding.
DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models cs.CV · 2026-06-04 · unverdicted · none · ref 2 · internal anchor
DRIFT adapts pretrained VLMs to continuous decoding via a base predictor plus residual flow matching, outperforming regression and generative baselines on grounding and robotic control tasks.
TextWand: A Unified Framework for Scene Text Editing cs.CV · 2026-06-04 · unverdicted · none · ref 61 · internal anchor
TextWand unifies scene text removal, generation and replacement via rendering/erasure decomposition, ORPE for layout fidelity, RAS for clean erasure, and the new TextWand-Bench dataset, claiming superior accuracy and quality over prior models.
ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation cs.CV · 2026-06-04 · unverdicted · none · ref 2 · internal anchor
ViCuR introduces recoverable visual cues as teacher privilege in multimodal on-policy distillation, yielding +1.19 to +1.24 average gains over answer-based baselines across seven benchmarks with Qwen3-VL students.
ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions cs.CV · 2026-06-04 · unverdicted · none · ref 1 · internal anchor
ShotCrop uses three-stage training (CoT SFT, pseudo-label semi-supervised, GRPO-S) to produce triple-shot compositions and reports 2.82x better shot localization than GPT-5 on a 1.2k expert benchmark.
WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark cs.CV · 2026-06-04 · unverdicted · none · ref 4 · internal anchor
WorldBench is a visually diverse multimodal reasoning benchmark where the strongest of 15 tested MLLMs reaches only 64% accuracy.
SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation cs.CV · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
SD-GRPO extends GRPO by computing per-segment advantages via z-normalization of verifiable segment rewards, yielding gains on long-form VL tasks with varying semantic independence across segments.
TGV-KV: Text-Grounded KV Eviction for Vision-Language Models cs.CV · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
TGV-KV uses text-vision budgeting, weighted ranking, and prioritised retention to evict KV cache in VLMs while retaining 99.2% accuracy at 5% budget on VizWiz-VQA.
AdaCodec: A Predictive Visual Code for Video MLLMs cs.CV · 2026-06-01 · unverdicted · none · ref 51 · internal anchor
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models cs.CV · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
VLMs recognize rotated images when shown directly but fail to predict rotated outcomes from originals on the new RotOutBench benchmark.
InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark cs.CV · 2026-06-01 · unverdicted · none · ref 8 · internal anchor
The paper creates InsightVQA, a 725K QA-pair benchmark with perception, grounded-understanding, and cognition levels for emotion-cognitive visual question answering, plus a 30K-sample evaluation set and InsightNet baseline.
InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models cs.CV · 2026-06-01 · unverdicted · none · ref 32 · internal anchor
InfoMerge proposes a training-free visual token compression method for Video-LLMs that uses Temporal Fingerprint Difference for redundancy estimation and Content-Aware Budget Allocation to retain 98.8% performance with 85% fewer tokens.
MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention cs.CV · 2026-06-01 · unverdicted · none · ref 7 · internal anchor
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models cs.CV · 2026-06-01 · unverdicted · none · ref 30 · internal anchor
STaR-KV is a training-free KV cache compression framework for GUI VLMs that uses subspace-aware scoring, temporal stability discounts, and entropy-based temperature adaptation to outperform prior methods at matched budgets while reducing peak memory by ~40% at 20% cache size.
PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion cs.CV · 2026-05-31 · unverdicted · none · ref 1 · internal anchor
PAI-Studio reformulates cinematic background replacement as in-context conditional generation inside a Diffusion Transformer with bidirectional attention, trained on a new 30K film-sourced dataset, and reports better motion consistency and relighting than prior open-source and commercial systems.
Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning cs.CV · 2026-05-31 · unverdicted · none · ref 1 · internal anchor
Reasmory turns 3D reconstruction into validated program-executable memory for VLMs, yielding 6-18% gains on spatial reasoning benchmarks over direct baselines.
RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes cs.CV · 2026-05-30 · unverdicted · none · ref 1 · internal anchor
RoboStressBench decomposes visual stress into four physically grounded dimensions to benchmark VLM robustness in embodied scenes and proposes a stress-aware solver.
FlowNar: Scalable Streaming Narration for Long-Form Videos cs.CV · 2026-05-30 · unverdicted · none · ref 2 · internal anchor
FlowNar achieves bounded memory and 3x higher throughput for streaming narration on Ego4D, EgoExo4D, and EpicKitchens100 by combining dynamic historical context removal with a Cross Linear Attentive Memory module.
Zamba2-VL Technical Report cs.CV · 2026-05-29 · unverdicted · none · ref 63 · internal anchor
Zamba2-VL is a family of 1.2B–7B hybrid Mamba2-transformer vision-language models that match leading transformer VLMs on image, reasoning, OCR, grounding and counting benchmarks while delivering roughly 10x lower time-to-first-token.
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement cs.CV · 2026-05-29 · unverdicted · none · ref 53 · internal anchor
StressDream optimizes initial noise in diffusion video world models using VLM semantic and plausibility objectives to steer generations toward specified high-impact outcomes for improved policy evaluation.
SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models cs.CV · 2026-05-29 · unverdicted · none · ref 6 · 2 links · internal anchor
SOCO is a new benchmark for semantic object correspondence that provides taxonomy, annotations, and language labels to evaluate part-level understanding in vision and multimodal foundation models.
nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving cs.CV · 2026-05-29 · unverdicted · none · ref 83 · internal anchor
nuReasoning is a new real-world dataset and benchmark extending nuScenes/nuPlan with 20k clips and multi-type reasoning annotations to evaluate and improve reasoning in long-tail autonomous driving.
Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning cs.CV · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.
Task-Focused Memorization for Multimodal Agents cs.CV · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.
PEEK: Picking Essential frames via Efficient Knowledge distillation cs.CV · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
PEEK distills caption-conditioned frame relevance into a lightweight visual model, outperforming adaptive baselines on ActivityNet Captions and MSR-VTT especially at 1-2 frame budgets while adding only 5.2% overhead.
Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation cs.CV · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
The paper diagnoses Template Collapse in 3D medical VLMs for CT reports and introduces CLarGen, a decoupled detection-plus-synthesis framework that raises macro-F1 from 0.189 to 0.487.
VLM3: Vision Language Models Are Native 3D Learners cs.CV · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
GPIC: A Giant Permissive Image Corpus for Visual Generation cs.CV · 2026-05-28 · unverdicted · none · ref 19 · internal anchor
GPIC is a new 28-trillion-pixel permissively licensed image corpus with 100M training examples for visual generative modeling.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion cs.CV · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
LoMo is a lightweight data curation technique that locally substitutes text with images in prompts to enforce cross-modal invariance, yielding 2.67-2.82 point gains over standard SFT on two VLMs across 13 benchmarks.
AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning cs.CV · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
AgentCVR uses coordinated multi-agent active search and script-simulated RL to improve cross-video reasoning in MLLMs over single-pass baselines.
Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning cs.CV · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
Inverse dynamics prediction is added as an auxiliary task to reduce state aliasing in VLA models by directly supervising the vision encoder on action-relevant visual distinctions using only standard observation-action pairs.
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling cs.CV · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
AnyMo is a masked-modeling framework for any-modality human motion generation trained on the new OmniHuMo dataset of 5,000+ hours of multimodal motion sequences.
Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization cs.CV · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
GCPO performs per-token credit assignment in discrete policy optimization by setting token advantages proportional to the difference in model predictions under positive versus negative prompts, outperforming GRPO and DAPO on text-to-image and chain-of-thought tasks.
Self-Prophetic Decoding to Unlock Visual Search in LVLMs cs.CV · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
SeProD is a plug-and-play self-prophetic decoding framework that combines pre- and post-training LVLM capabilities via probability-based sampling to improve coherent visual search and multi-step reasoning.
DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving cs.CV · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
DriveWAM converts video generative priors into a unified video-action policy for driving, reporting strong benchmark performance and positive scaling from 4k to 100k clips.
Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation cs.CV · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
Proprio uses flow residuals from latent perturbations in frozen video generators as a self-scoring signal for physical plausibility, yielding reported gains of 16.5% on Physics-IQ and 20.6% on VideoPhy2-hard.
SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control cs.CV · 2026-05-27 · unverdicted · none · ref 15 · internal anchor
SmartDirector generates cinematic videos via Director-Gen for low-res keyframe-conditioned output followed by Director-SR refinement using high-res keyframes, trained on curated movie sequences.
Reflective Dialogue between Teacher and Solver Agents for Video Question Answering cs.CV · 2026-05-27 · unverdicted · none · ref 15 · internal anchor
A multi-turn reflective dialogue between Teacher and Solver agents constructs richer context from support examples than standard in-context learning, improving video QA on the EgoCross benchmark.
Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought cs.CV · 2026-05-26 · unverdicted · none · ref 1 · internal anchor
SegWorld adds multi-level visual CoT and proactive scene observation to segmentation models, formalized as probabilistic inference, and shows gains on an intent-to-part benchmark for affordance segmentation.
Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models cs.CV · 2026-05-26 · unverdicted · none · ref 1 · internal anchor
VideoLLMs process videos as unordered collections of events and hallucinate cross-segment interactions when distractions are inserted, a behavior observed across all 11 tested models.