pith. machine review for the scientific record. sign in

arxiv: 2308.12966 · v3 · submitted 2023-08-24 · 💻 cs.CV · cs.CL

Recognition: 1 theorem link

· Lean Theorem

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:44 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modelvisual groundingtext readingimage captioningquestion answeringmultimodal learningzero-shotfew-shot
0
0 comments X

The pith

Vision-language models gain localization and text-reading skills by aligning image, caption, and box data during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a series of vision-language models built by adding visual processing to an existing language model foundation. It uses a visual receptor, a custom input-output interface, a three-stage training sequence, and a cleaned multilingual multimodal dataset to enable the models to describe images, answer questions, locate objects, and read text within pictures. The key step is aligning tuples of images, captions, and bounding boxes so the model learns precise spatial references. If this works as claimed, a single generalist model can handle multiple visual tasks at once instead of needing separate specialized tools. This would matter for applications where AI needs to point out or describe specific parts of an image while understanding accompanying text.

Core claim

Starting from a language model base, the models acquire visual capacity through a visual receptor, input-output interface, three-stage training pipeline, and multilingual multimodal cleaned corpus. Aligning image-caption-box tuples adds grounding and text-reading abilities. The resulting models, including the base and chat versions, set new records for generalist models of similar scale on visual-centric benchmarks such as image captioning, question answering, and visual grounding in both zero-shot and few-shot settings, and they also outperform prior vision-language chatbots on real-world dialog tasks.

What carries the argument

Alignment of image-caption-box tuples within a three-stage training pipeline that adds visual capacity to a language model base.

If this is right

  • The models achieve leading results among similar-scale generalists on image captioning, question answering, and visual grounding benchmarks.
  • They maintain strong performance in zero-shot and few-shot evaluation settings without task-specific fine-tuning.
  • The chat-tuned version surpasses existing vision-language chatbots on real-world dialog benchmarks.
  • The models can perform localization and text reading in addition to basic image description and question answering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training approach could be applied to other language model bases to create additional versatile multimodal systems.
  • Emphasis on multilingual cleaned data may lead to stronger results on visual tasks involving non-English text or captions.
  • These capabilities could support downstream uses such as interactive image analysis tools that reference specific objects by location.

Load-bearing premise

That the specific combination of visual receptor, input-output interface, three-stage training pipeline, multilingual multimodal cleaned corpus, and image-caption-box alignment produces genuine generalization rather than benchmark-specific gains or data artifacts.

What would settle it

A new test set of images containing text and objects in novel combinations where the models fail to outperform prior generalist models of similar size on localization or text-reading accuracy.

read the original abstract

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Qwen-VL series of large vision-language models built on the Qwen-LM foundation. It adds visual capacity via a visual receptor, input-output interface, 3-stage training pipeline, and a cleaned multilingual multimodal corpus, with additional image-caption-box alignment to enable grounding and text-reading. The resulting Qwen-VL and Qwen-VL-Chat models are claimed to set new records for generalist models of similar scale on visual-centric benchmarks (captioning, VQA, grounding) in zero- and few-shot settings, and to outperform existing vision-language chatbots on real-world dialog benchmarks. Code, models, and a demo are released.

Significance. If the empirical claims hold under rigorous evaluation, the work would be significant for demonstrating a scalable recipe that extends a strong language model into a versatile generalist capable of localization and text reading in addition to standard VQA and captioning. The open release of code, models, and demo is a clear strength that enables reproducibility and follow-on research.

major comments (3)
  1. [Experiments] Experiments section (and associated tables): the manuscript asserts new state-of-the-art results on multiple benchmarks but provides no component-wise ablations (e.g., 2-stage vs. 3-stage training, with vs. without box alignment, or controlled data-matched baselines against Qwen-LM scale alone). This leaves the central claim that the reported gains arise specifically from the visual receptor, 3-stage pipeline, and alignment rather than from data volume/quality or model scale untested.
  2. [Results] Results tables: no error bars, multiple runs, or statistical significance tests are reported for the benchmark numbers, and the evaluation protocols (exact prompts, few-shot examples, preprocessing) are not fully specified, making it impossible to verify the claimed records or compare fairly to prior work.
  3. [Training Pipeline] Section 3 (training pipeline): the description of the 3-stage training and the image-caption-box alignment objective is high-level; without quantitative isolation of each stage's contribution or details on how the alignment loss interacts with the language-modeling objective, the causal role of these design choices in the final performance remains unclear.
minor comments (2)
  1. [Abstract] The abstract and introduction use the phrase 'set new records' without immediately citing the specific tables or prior best scores being surpassed; adding explicit cross-references would improve readability.
  2. [Model Architecture] Notation for the visual receptor and input-output interface could be made more precise (e.g., explicit tensor shapes or layer dimensions) to aid replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and have revised the manuscript accordingly to improve clarity, rigor, and reproducibility.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (and associated tables): the manuscript asserts new state-of-the-art results on multiple benchmarks but provides no component-wise ablations (e.g., 2-stage vs. 3-stage training, with vs. without box alignment, or controlled data-matched baselines against Qwen-LM scale alone). This leaves the central claim that the reported gains arise specifically from the visual receptor, 3-stage pipeline, and alignment rather than from data volume/quality or model scale untested.

    Authors: We agree that explicit component-wise ablations strengthen the attribution of gains to our design choices. In the revised manuscript we add a dedicated ablation subsection that reports: (i) 3-stage vs. 2-stage training on the same data mixture, (ii) performance with and without the image-caption-box alignment objective, and (iii) a controlled comparison against a Qwen-LM-scale baseline trained on identical multimodal data but without the visual receptor and alignment stages. These results indicate that the largest incremental gains arise from the 3-stage pipeline and box alignment rather than data volume alone. revision: yes

  2. Referee: [Results] Results tables: no error bars, multiple runs, or statistical significance tests are reported for the benchmark numbers, and the evaluation protocols (exact prompts, few-shot examples, preprocessing) are not fully specified, making it impossible to verify the claimed records or compare fairly to prior work.

    Authors: We have expanded the experimental setup and appendix to provide complete evaluation protocols, including the exact prompts, few-shot example selections, and preprocessing pipelines used for every benchmark. Regarding error bars and multiple runs, each full training run of these models requires substantial compute; we therefore report single-run results and have added an explicit limitations paragraph noting this constraint and the consequent absence of statistical significance tests. revision: partial

  3. Referee: [Training Pipeline] Section 3 (training pipeline): the description of the 3-stage training and the image-caption-box alignment objective is high-level; without quantitative isolation of each stage's contribution or details on how the alignment loss interacts with the language-modeling objective, the causal role of these design choices in the final performance remains unclear.

    Authors: We have revised Section 3 to include a more granular description of each training stage, the precise form of the alignment loss, and the weighting schedule used to combine it with the standard language-modeling objective. The new ablation studies mentioned above provide quantitative isolation of each stage's contribution, directly addressing the request for evidence of causal impact. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model description and benchmark reporting

full rationale

The paper describes a sequence of engineering choices (visual receptor, I/O interface, 3-stage training, multilingual corpus, image-caption-box alignment) starting from Qwen-LM, then reports aggregate results on standard visual-centric benchmarks under zero-shot and few-shot settings. No mathematical derivations, first-principles predictions, or equations appear in the provided text. Performance claims are direct empirical comparisons against other models of similar scale; they do not reduce to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations of uniqueness theorems. Training decisions are presented as design choices, not as outputs derived from the target results. This is a standard LVLM release paper whose central claims rest on external benchmark numbers rather than internal circular reductions.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions of deep learning optimization and data quality plus several design choices whose effectiveness is asserted rather than derived.

free parameters (2)
  • model scale and architecture dimensions
    Specific sizes and layer counts for Qwen-VL variants are chosen as part of the design.
  • 3-stage training hyperparameters
    Learning rates, batch sizes, and stage durations are tuned during development.
axioms (2)
  • domain assumption The visual receptor and input-output interface successfully integrate image features into the language model without catastrophic interference.
    Invoked in the description of endowing Qwen-LM with visual capacity.
  • domain assumption Alignment of image-caption-box tuples produces reliable grounding and text-reading abilities.
    Stated as the method for implementing grounding and text-reading.

pith-pipeline@v0.9.0 · 5534 in / 1361 out tokens · 26377 ms · 2026-05-10T12:44:15.340249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

    cs.CV 2026-04 unverdicted novelty 8.0

    S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

  2. SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

    cs.NE 2026-04 unverdicted novelty 8.0

    SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.

  3. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  4. OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems

    cs.DB 2026-05 conditional novelty 7.0

    OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.

  5. Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles

    cs.CY 2026-05 unverdicted novelty 7.0

    A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.

  6. ImageAttributionBench: How Far Are We from Generalizable Attribution?

    cs.CV 2026-05 unverdicted novelty 7.0

    ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

  7. ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

    cs.CL 2026-05 unverdicted novelty 7.0

    ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...

  8. CATS: Curvature Aware Temporal Selection for efficient long video understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.

  9. UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.

  10. GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

  11. OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning

    q-bio.GN 2026-05 unverdicted novelty 7.0

    OmicsLM integrates continuous omics embeddings into LLMs for multi-sample biological reasoning, matching specialized models on profile tasks while outperforming them and general LLMs on language-guided QA over real ex...

  12. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

    cs.CL 2026-05 unverdicted novelty 7.0

    OralMLLM-Bench is a new benchmark with 27 tasks in four cognitive categories that evaluates six MLLMs on dental radiographs and shows clear performance gaps versus clinicians.

  13. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

    cs.CL 2026-05 unverdicted novelty 7.0

    OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.

  14. SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

    cs.AI 2026-04 unverdicted novelty 7.0

    SpecVQA is a new benchmark dataset and evaluation suite for testing multimodal large language models on scientific spectral image understanding and visual question answering, supported by a curve-preserving sampling m...

  15. TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

    cs.CV 2026-04 unverdicted novelty 7.0

    A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.

  16. TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

    cs.CV 2026-04 accept novelty 7.0

    TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

  17. Membership Inference Attacks Against Video Large Language Models

    cs.CR 2026-04 unverdicted novelty 7.0

    A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.

  18. QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

    quant-ph 2026-04 unverdicted novelty 7.0

    Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.

  19. Improving Vision-language Models with Perception-centric Process Reward Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

  20. ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...

  21. CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language

    cs.CL 2026-04 unverdicted novelty 7.0

    CNSL-bench shows current MLLMs perform substantially worse than humans on Chinese sign language tasks with systematic gaps across modalities and articulatory forms.

  22. Probing Visual Planning in Image Editing Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.

  23. ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

    cs.SD 2026-04 unverdicted novelty 7.0

    ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.

  24. EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

    cs.CV 2026-04 unverdicted novelty 7.0

    EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

  25. Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents

    cs.CR 2026-04 unverdicted novelty 7.0

    Desktop GUI agents face TOCTOU attacks from UI state changes during the ~6.5s observation-to-action gap, with a three-layer pre-execution verification defense achieving 100% interception on two attack types but failin...

  26. MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation

    cs.CL 2026-04 unverdicted novelty 7.0

    MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other metho...

  27. S-GRPO: Unified Post-Training for Large Vision-Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

  28. MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

    cs.MM 2026-04 unverdicted novelty 7.0

    MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...

  29. Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems

    cs.CV 2026-04 conditional novelty 7.0

    Paza is a zero-shot, model-agnostic pipeline that uses behavioral pre-filters on cheap object and pose models to trigger expensive VLMs only when needed, delivering 89.5% precision and 92.8% specificity on a synthesiz...

  30. MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror

    cs.AI 2026-04 unverdicted novelty 7.0

    MirrorBench reveals that leading MLLMs perform far below humans on tasks requiring self-referential perception and representation, even at the simplest level.

  31. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  32. Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...

  33. MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

  34. UIPress: Bringing Optical Token Compression to UI-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...

  35. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  36. GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

    cs.CV 2026-04 unverdicted novelty 7.0

    GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.

  37. Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

    cs.CV 2026-04 unverdicted novelty 7.0

    Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

  38. IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

    cs.AI 2026-04 unverdicted novelty 7.0

    IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.

  39. An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

    cs.AI 2026-04 unverdicted novelty 7.0

    An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...

  40. ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

    cs.CV 2026-04 unverdicted novelty 7.0

    ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.

  41. The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.

  42. Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.

  43. QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    QAPruner introduces a hybrid sensitivity metric that combines group-wise quantization error simulation and outlier intensity with semantic scores to prune visual tokens, yielding 2.24% higher accuracy than naive basel...

  44. Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

    cs.CV 2026-03 unverdicted novelty 7.0

    Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.

  45. SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis

    cs.CV 2026-03 conditional novelty 7.0

    SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.

  46. Topo-R1: Detecting Topological Anomalies via Vision-Language Models

    cs.CV 2026-03 unverdicted novelty 7.0

    Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.

  47. More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

    cs.CV 2026-03 unverdicted novelty 7.0

    Panorama-Language Models with a sparse attention module and PanoVQA dataset deliver superior holistic reasoning on 360° adverse omni-scenes compared to stitched pinhole views.

  48. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    cs.CV 2024-10 accept novelty 7.0

    PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

  49. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  50. Evaluating Object Hallucination in Large Vision-Language Models

    cs.CV 2023-05 accept novelty 7.0

    Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.

  51. When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

    cs.AI 2026-05 conditional novelty 6.0

    LongAct benchmark reveals top VLMs reach only 59% goal completion and 16% full success on long-horizon household tasks, while HoloMind agent improves results via DAG planner, multimodal spatial memory, episodic memory...

  52. MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.

  53. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  54. When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.

  55. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 conditional novelty 6.0

    Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

  56. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 unverdicted novelty 6.0

    Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.

  57. ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

    cs.CL 2026-05 unverdicted novelty 6.0

    ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...

  58. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  59. Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

    cs.CR 2026-05 unverdicted novelty 6.0

    DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.

  60. SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 161 Pith papers

  1. [1]

    Removing pairs with too large aspect ratio of the image

  2. [2]

    Removing pairs with too small image

  3. [3]

    Removing pairs with a harsh CLIP score (dataset-specific)

  4. [4]

    Removing pairs with text containing non-English or non-Chinese characters

  5. [5]

    Removing pairs with text containing emoji characters

  6. [6]

    Removing pairs with text length too short or too long

  7. [7]

    Cleaning the text’s HTML-tagged part

  8. [8]

    If there is more than one text matching the same image, we select the longest one

    Cleaning the text with certain unregular patterns For academic caption datasets, we remove pairs whose text contains the special tags in CC12M (Changpinyo et al., 2021) and SBU (Ordonez et al., 2011). If there is more than one text matching the same image, we select the longest one. A.2 VQA FortheVQAv2(Goyaletal.,2017)dataset,weselecttheanswerannotationba...

  9. [9]

    to get the rendering results of each page in a PDF file as well as all the text annotations with their bounding boxes

  10. [10]

    16 Figure 5: Visualization of the Grounding and OCR data used for training Qwen-VL 17

    Extracting all texts and their bounding boxes for each page. 16 Figure 5: Visualization of the Grounding and OCR data used for training Qwen-VL 17

  11. [14]

    Latin Extended-A

    Removing images containing Unicode characters in the “Latin Extended-A” and “Latin Extended-B” blocks

  12. [15]

    Private Use Area (PUA)

    Removing images containing Unicode characters in the “Private Use Area (PUA)” block. For all HTML web pages we collected, we pre-process them in a similar approach to all the PDF data we collected, but we use Puppeteer (Google, 2023) instead of PyMuPDF to render these HTML pages and get the ground truth annotation. We follow the steps below to pre-process...

  13. [16]

    Extracting all texts for each webpage

  14. [17]

    Rendering each page and save them as an image file

  15. [18]

    Removing too small image

  16. [19]

    Removing images with too many or too few characters

  17. [20]

    Private Use Area (PUA)

    Removing images containing Unicode characters in the “Private Use Area (PUA)” block. B Data Format Details of Training B.1 Data Format of Multi-Task Pre-training We visualize the Multi-Task Pre-training data format in Box B.1. The Box contains all 7 tasks with the black-colored text as the prefix sequence without loss and blue-colored text as the ground t...