super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

768 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 768 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 120 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 113 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

SpatialUAV: Benchmarking Spatial Intelligence for Low-Altitude UAV Perception, Collaboration, and Motion

cs.CV · 2026-06-26 · accept · novelty 7.0

SpatialUAV is a new real-world benchmark dataset and evaluation suite exposing large gaps between vision-language models and human performance on spatial tasks for low-altitude UAVs.

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

cs.CL · 2026-06-15 · unverdicted · novelty 7.0

Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

citing papers explorer

Showing 50 of 73 citing papers after filters.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation cs.AI · 2026-05-08 · unverdicted · none · ref 29 · internal anchor
RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.
WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments cs.AI · 2026-04-30 · accept · none · ref 1 · internal anchor
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
RAISE: RAG Design as an Architecture Search Problem cs.AI · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
RAISE is a standardized benchmark for RAG hyperparameter optimization that evaluates 13 search algorithms across seven datasets and finds performance is highly task-dependent.
OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields cs.AI · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
OmniMatBench is a new human-calibrated benchmark for multimodal materials-science reasoning that reveals the best evaluated MLLM scores only 0.372 overall.
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? cs.AI · 2026-05-26 · unverdicted · none · ref 11 · internal anchor
LiveK12Bench is a growing multi-disciplinary benchmark showing LMMs like GPT-5 drop from 79 to 53 under realistic exam constraints including process rigor and efficiency.
Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability cs.AI · 2026-05-21 · unverdicted · none · ref 37 · internal anchor
Introduces Synergistic Faithfulness metric based on Shapley Interaction Index to evaluate cross-modal synergy in VLM explainers, revealing over-reliance on visual salience in existing methods.
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality? cs.AI · 2026-05-21 · unverdicted · none · ref 7 · internal anchor
Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain cs.AI · 2026-05-18 · unverdicted · none · ref 29 · 2 links · internal anchor
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control cs.AI · 2026-05-15 · unverdicted · none · ref 3 · internal anchor
PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
From Table to Cell: Attention for Better Reasoning with TABALIGN cs.AI · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games? cs.AI · 2026-05-11 · unverdicted · none · ref 27 · 2 links · internal anchor
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD cs.AI · 2026-05-11 · unverdicted · none · ref 16 · 2 links · internal anchor
BenchCAD benchmark shows frontier multimodal models recover coarse geometry but fail to produce accurate parametric CAD programs for industrial parts, with limited generalization after fine-tuning.
Active Testing of Large Language Models via Approximate Neyman Allocation cs.AI · 2026-05-11 · unverdicted · none · ref 1 · 2 links · internal anchor
Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings versus uniform sampling.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 2 · 2 links · internal anchor
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents cs.AI · 2026-05-09 · unverdicted · none · ref 38 · 3 links · internal anchor
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs cs.AI · 2026-05-07 · unverdicted · none · ref 4 · 2 links · internal anchor
CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks cs.AI · 2026-05-05 · unverdicted · none · ref 13 · internal anchor
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents cs.AI · 2026-05-02 · unverdicted · none · ref 61 · internal anchor
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
Faithful Mobile GUI Agents with Guided Advantage Estimator cs.AI · 2026-05-02 · unverdicted · none · ref 1 · internal anchor
Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding cs.AI · 2026-05-01 · accept · none · ref 1 · 2 links · internal anchor
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images cs.AI · 2026-04-30 · unverdicted · none · ref 5 · internal anchor
SpecVQA is a new benchmark dataset and evaluation suite for testing multimodal large language models on scientific spectral image understanding and visual question answering, supported by a curve-preserving sampling method that improves results.
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification cs.AI · 2026-04-18 · unverdicted · none · ref 5 · internal anchor
Rule-VLN is the first large-scale benchmark injecting 177 regulatory categories into an urban environment, and the proposed SNRM module equips pre-trained VLN agents with zero-shot semantic reasoning and detour planning to reduce constraint violations by 19.26% and improve task completion.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management cs.AI · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables cs.AI · 2026-04-04 · conditional · none · ref 2 · internal anchor
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
Visual Graph Scaffolds for Structural Reasoning in Large Language Models cs.AI · 2026-06-01 · unverdicted · none · ref 5 · internal anchor
Visual graph mind maps outperform text-flattened versions as internal reasoning scaffolds for LLMs on multi-hop QA, with the advantage holding after fine-tuning and distillation.
ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control cs.AI · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
ReasonLight uses multimodal foundation models to refine RL-proposed traffic signal phases based on camera images and sensor data, enabling zero-shot adaptation to unseen events such as emergency vehicle priority.
MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing cs.AI · 2026-05-27 · unverdicted · none · ref 29 · internal anchor
MACReD is a multi-agent collaborative reasoning framework for reaction diagram parsing that reports state-of-the-art F1 scores of 75.2% and 84.6% on the RxnScribe benchmark.
Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization cs.AI · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
RC-DPO adds a CoT-conditioned preference term to DPO and pairs it with MCTS-based positive CoT generation plus attention-guided pruning for negatives, yielding lower hallucination rates on multimodal benchmarks.
Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions cs.AI · 2026-05-25 · unverdicted · none · ref 3 · internal anchor
POLAR organizes prior interactions into a multimodal knowledge graph with semantic and episodic memory to improve personalized embodied task execution across multiple MLLM backbones.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR cs.AI · 2026-05-19 · unverdicted · none · ref 14 · internal anchor
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees cs.AI · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Reinforcing VLAs in Task-Agnostic World Models cs.AI · 2026-05-12 · unverdicted · none · ref 2 · 2 links · internal anchor
RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification cs.AI · 2026-05-12 · unverdicted · none · ref 49 · internal anchor
MolDeTox is a new benchmark that shows fragment-level stepwise editing by LLMs and VLMs improves structural validity and detoxification quality over prior toxicity-focused evaluations.
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation cs.AI · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 6 · 3 links · internal anchor
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria cs.AI · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning cs.AI · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
ProCompNav: Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries cs.AI · 2026-05-07 · unverdicted · none · ref 46 · 3 links · internal anchor
ProCompNav builds a candidate pool from ambiguous queries then uses pool-splitting binary questions for disambiguation, improving success rate and shortening responses on CoIN-Bench and TextNav.
SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs cs.AI · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
SPARK constructs unified knowledge graphs from multi-document scientific literature to ground self-play RL with asymmetric roles and verifiable rewards, outperforming flat-corpus baselines especially on longer-hop reasoning tasks.
WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning cs.AI · 2026-04-30 · unverdicted · none · ref 3 · 3 links · internal anchor
A 4B-parameter vision-language model trained on rubric-guided synthetic wafer defect data reaches 6.493 LLM-Judge score, nearly matching Gemini-3-Flash at 7.149 for on-premise industrial use.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations cs.AI · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization cs.AI · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology cs.AI · 2026-04-19 · unverdicted · none · ref 2 · internal anchor
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
Hybrid Decision Making via Conformal VLM-generated Guidance cs.AI · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
ConfGuide uses conformal risk control to generate targeted guidance sets in a learning-to-guide hybrid decision framework and demonstrates it on multi-label medical diagnosis.
MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games cs.AI · 2026-04-14 · unverdicted · none · ref 2 · internal anchor
MISID is a multimodal multi-turn dataset for intent recognition in strategic deception games, paired with the FRACTAM framework that improves MLLM performance on hidden intent detection via decouple-anchor-reason steps.
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs cs.AI · 2026-04-14 · unverdicted · none · ref 10 · internal anchor
MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals via iterative nullspace projection while transferring strategies through a shared
Narrative-Driven Paper-to-Slide Generation via ArcDeck cs.AI · 2026-04-13 · unverdicted · none · ref 41 · internal anchor
ArcDeck models paper-to-slide generation as narrative reconstruction using discourse parsing and multi-agent refinement, plus a new ArcBench benchmark, to improve flow and coherence over direct summarization.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training cs.AI · 2026-04-12 · unverdicted · none · ref 2 · 2 links · internal anchor
TorchUMM is the first unified codebase and benchmark suite for multimodal understanding, generation, and editing across varied UMM models and datasets.

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer