super hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (47%).

717 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 717 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 120 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 113 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

co-cited works

representative citing papers

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Common to Whom? Regional Cultural Commonsense and LLM Bias in India

cs.CL · 2026-01-22 · unverdicted · novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

cs.CV · 2026-01-01 · unverdicted · novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

cs.CV · 2025-12-03 · accept · novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

citing papers explorer

Showing 49 of 49 citing papers after filters.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence cs.CL · 2026-05-13 · accept · none · ref 2 · internal anchor
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
Lost in Translation: Do LVLM Judges Generalize Across Languages? cs.CL · 2026-04-21 · unverdicted · none · ref 8 · internal anchor
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
Common to Whom? Regional Cultural Commonsense and LLM Bias in India cs.CL · 2026-01-22 · unverdicted · none · ref 5 · internal anchor
Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 79 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language cs.CL · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
GeoBuildBench is a new benchmark requiring LLMs to generate executable geometry constructions from text, revealing frequent hallucinations, missing objects, and constraint failures in state-of-the-art models.
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents cs.CL · 2026-05-11 · unverdicted · none · ref 41 · internal anchor
TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs cs.CL · 2026-05-10 · unverdicted · none · ref 2 · internal anchor
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection cs.CL · 2026-05-09 · accept · none · ref 5 · internal anchor
CiteTracer detects citation hallucinations at 97.1% accuracy on synthetic and real-world benchmarks by combining structured extraction, multi-source retrieval, deterministic matching, and class-specialist agents.
TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity cs.CL · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
TableVista benchmark finds foundation models maintain performance across visual styles but degrade sharply on complex table structures and vision-only settings.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents cs.CL · 2026-04-27 · unverdicted · none · ref 89 · internal anchor
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts cs.CL · 2026-04-20 · unverdicted · none · ref 66 · internal anchor
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows cs.CL · 2026-04-17 · conditional · none · ref 50 · internal anchor
GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts cs.CL · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation cs.CL · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents cs.CL · 2026-04-07 · unverdicted · none · ref 3 · internal anchor
EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework cs.CL · 2026-02-23 · unverdicted · none · ref 1 · internal anchor
Prune-then-Merge combines adaptive pruning of low-signal patches with hierarchical merging to achieve higher compression rates and better performance than prior single-stage methods in visual document retrieval.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding cs.CL · 2026-02-02 · unverdicted · none · ref 13 · internal anchor
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval cs.CL · 2026-05-04 · unverdicted · none · ref 8
Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations in novelty prediction.
Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue cs.CL · 2026-05-31 · unverdicted · none · ref 51 · internal anchor
RefMem-Bench benchmarks reflective memory in dialogue with 26K instances across eight dimensions, and REMIND improves model accuracy via hierarchical evidence retrieval, grounding, and abstraction.
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models cs.CL · 2026-05-19 · conditional · none · ref 2 · internal anchor
Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.
ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 30 · internal anchor
ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CL · 2026-05-17 · unverdicted · none · ref 78 · internal anchor
Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging cs.CL · 2026-05-13 · conditional · none · ref 60 · 2 links · internal anchor
DiM3 is a direction- and magnitude-aware merging method that composes heterogeneous multilingual and multimodal updates in LLM backbones, outperforming baselines on 57-language benchmarks while retaining multimodal performance.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification cs.CL · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing cs.CL · 2026-05-05 · unverdicted · none · ref 21 · internal anchor
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation cs.CL · 2026-04-23 · conditional · none · ref 10 · internal anchor
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue cs.CL · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
Incremental visual scaffolding using multimodal models improves persistent common ground representation in situated dialogue by reducing representational blur compared to text-only approaches, with hybrid text-visual yielding best results on the IndiRef benchmark.
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models cs.CL · 2026-04-11 · unverdicted · none · ref 3 · internal anchor
AITP is a new multimodal large language model that uses multimodal chain-of-thought and retrieval-augmented generation of legal knowledge to achieve state-of-the-art results on traffic accident responsibility allocation and related tasks, supported by the DecaTARA benchmark of 67,941 videos.
Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing cs.CL · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
BAIM enriches knowledge tracing item representations by deriving stage-level embeddings from Polya's four problem-solving stages and routing them adaptively per learner context, yielding consistent gains over pretraining baselines on two datasets.
CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era cs.CL · 2026-02-26 · unverdicted · none · ref 9 · internal anchor
CiteAudit supplies a human-validated benchmark and multi-agent verification system that outperforms existing LLMs and commercial tools at detecting hallucinated scientific references.
Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training cs.CL · 2026-02-04 · unverdicted · none · ref 7 · internal anchor
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding cs.CL · 2026-01-29 · unverdicted · none · ref 2 · internal anchor
CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.
STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments cs.CL · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
STAMP trains explicit memory for mobile GUI agents via virtual environments with controlled memory injection, achieving SOTA on the new Memory-World benchmark.
OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models cs.CL · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
OmniThoughtVis curates 1.8M multimodal CoT samples via teacher distillation, difficulty annotation, and tag-based sampling, yielding consistent gains on nine reasoning benchmarks and allowing 4B models to match or beat undistilled 8B baselines.
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis cs.CL · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs cs.CL · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
DLR is a new reinforced latent reasoning method for VLMs that decomposes queries, uses continuous visual latents, and outperforms text-only and multimodal CoT baselines on vision-centric benchmarks with better interpretability.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence cs.CL · 2026-04-08 · unverdicted · none · ref 3 · internal anchor
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Kimi K2.5: Visual Agentic Intelligence cs.CL · 2026-02-02 · unverdicted · none · ref 8 · internal anchor
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants cs.CL · 2026-06-11 · unverdicted · none · ref 1 · internal anchor
SkillChain automates skill lifecycle for e-commerce image AI assistants via creator, optimizer, and refiner stages, leading to improved response quality and user engagement in production A/B tests.
Tracing the ongoing emergence of human-like reasoning in Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 84 · internal anchor
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
Ministral 3 cs.CL · 2026-01-13 · unverdicted · none · ref 4 · internal anchor
Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task cs.CL · 2026-05-20 · accept · none · ref 17 · internal anchor
A retrieval-augmented two-stage system using Qwen2.5-VL for Spanish captions and Gemini 2.5 Flash for target-language generation achieves over 120% chrF++ gains on three Indigenous languages and wins the shared task.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction cs.CL · 2026-05-11 · unreviewed · ref 2 · 2 links · internal anchor
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents cs.CL · 2026-05-11 · unreviewed · ref 2 · internal anchor
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers cs.CL · 2026-05-08 · unreviewed · ref 3 · 2 links · internal anchor
ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs cs.CL · 2026-04-07 · unreviewed · ref 7 · internal anchor
HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models cs.CL · 2026-03-31 · unreviewed · ref 9 · internal anchor
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis cs.CL · 2026-04-27 · unreviewed · ref 3

Qwen3-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer