super hub Mixed citations

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

author=, Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution · 2024 · cs.CV · arXiv 2409.12191

Mixed citation behavior. Most common role is background (60%).

503 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 503 citing papers more from author= arXiv PDF

abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 102 baseline 28 method 26 dataset 6 other 2

citation-polarity summary

background 98 baseline 28 use method 26 use dataset 6 unclear 5 support 1

claims ledger

abstract We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion

authors

author= Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

co-cited works

representative citing papers

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

cs.CV · 2026-05-19 · accept · novelty 8.0

NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

cs.CV · 2026-05-15 · conditional · novelty 8.0

MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

cs.CV · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 8.0

CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.

SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

cs.NE · 2026-04-13 · unverdicted · novelty 8.0

SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

cs.CV · 2026-03-16 · accept · novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

A document is worth a structured record: Principled inductive bias design for document recognition

cs.CV · 2025-07-11 · unverdicted · novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

CardioLens is a leakage-resistant CMR testbed of 473k slices and 13k QA pairs showing current MLLMs exhibit a large clinical reality gap with category-collapse failures on real workflows.

Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

Mind-Omni unifies seven brain-vision-language tasks in one discrete-diffusion framework with a brain tokenizer and a new BQA dataset, claiming SOTA multi-task performance competitive with larger single-task models.

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

cs.RO · 2026-05-21 · unverdicted · novelty 7.0

AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

cs.AI · 2026-05-20 · unverdicted · novelty 7.0

A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

cs.CV · 2026-05-20 · conditional · novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

cs.CV · 2026-05-19 · conditional · novelty 7.0

HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.

Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

cs.CV · 2026-05-19 · conditional · novelty 7.0

Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.

Modality-Decoupled Online Recursive Editing

cs.LG · 2026-05-19 · conditional · novelty 7.0

M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.

citing papers explorer

Showing 50 of 408 citing papers after filters.

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts cs.CV · 2026-05-12 · unverdicted · none · ref 56 · 2 links · internal anchor
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views cs.CV · 2026-05-08 · unverdicted · none · ref 47 · internal anchor
GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
MedHorizon: Towards Long-context Medical Video Understanding in the Wild cs.CV · 2026-05-07 · unverdicted · none · ref 57 · internal anchor
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models cs.CV · 2026-05-03 · unverdicted · none · ref 33 · internal anchor
CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.
SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression cs.NE · 2026-04-13 · unverdicted · none · ref 55 · internal anchor
SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 146 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
A document is worth a structured record: Principled inductive bias design for document recognition cs.CV · 2025-07-11 · unverdicted · none · ref 26 · internal anchor
Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.
DeepLatent: Think with Images via Parallel Latent Visual Reasoning cs.CV · 2026-05-30 · unverdicted · none · ref 90 · internal anchor
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models cs.CV · 2026-05-29 · unverdicted · none · ref 44 · internal anchor
YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 77 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations cs.CV · 2026-05-28 · unverdicted · none · ref 49 · internal anchor
CardioLens is a leakage-resistant CMR testbed of 473k slices and 13k QA pairs showing current MLLMs exhibit a large clinical reality gap with category-collapse failures on real workflows.
Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion cs.AI · 2026-05-28 · unverdicted · none · ref 85 · internal anchor
Mind-Omni unifies seven brain-vision-language tasks in one discrete-diffusion framework with a brain tokenizer and a new BQA dataset, claiming SOTA multi-task performance competitive with larger single-task models.
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation cs.RO · 2026-05-21 · unverdicted · none · ref 39 · internal anchor
AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality? cs.AI · 2026-05-21 · unverdicted · none · ref 68 · internal anchor
Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization cs.AI · 2026-05-20 · unverdicted · none · ref 47 · internal anchor
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain cs.AI · 2026-05-18 · unverdicted · none · ref 28 · 2 links · internal anchor
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation cs.CR · 2026-05-15 · unverdicted · none · ref 59 · internal anchor
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 13 · internal anchor
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.
Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture cs.CV · 2026-05-14 · unverdicted · none · ref 35 · internal anchor
TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% added parameters and up to 50% fewer reasoning tokens.
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling cs.CV · 2026-05-13 · unverdicted · none · ref 45 · internal anchor
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding cs.CV · 2026-05-13 · unverdicted · none · ref 20 · internal anchor
AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
Same Image, Different Meanings: Toward Retrieval of Context-Dependent Meanings cs.IR · 2026-05-13 · unverdicted · none · ref 23 · internal anchor
Image meanings grow more context-dependent with semantic abstraction, requiring narrative grounding for accurate retrieval at higher levels.
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs cs.CV · 2026-05-12 · unverdicted · none · ref 36 · internal anchor
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction cs.CV · 2026-05-12 · unverdicted · none · ref 31 · internal anchor
DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters cs.CV · 2026-05-12 · unverdicted · none · ref 53 · internal anchor
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision cs.CV · 2026-05-12 · unverdicted · none · ref 12 · internal anchor
A hierarchical VQA system aggregates model answers into weighted risk scores that produce four-category safety event maps for urban navigation, backed by a new 20-city dataset where generative MLLMs like Qwen-VL outperform classification models.
Allegory of the Cave: Measurement-Grounded Vision-Language Learning cs.AI · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video cs.CV · 2026-05-11 · unverdicted · none · ref 44 · internal anchor
StreamPro introduces a benchmark and training method using CB-Stream Loss and GRPO to enable proactive decision-making in streaming videos, achieving 41.5 on StreamPro-Bench compared to 10.4 previously.
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents cs.CL · 2026-05-11 · unverdicted · none · ref 37 · internal anchor
TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV · 2026-05-10 · unverdicted · none · ref 24 · internal anchor
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 34 · internal anchor
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models cs.RO · 2026-05-09 · unverdicted · none · ref 37 · internal anchor
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding cs.CV · 2026-05-08 · unverdicted · none · ref 33 · internal anchor
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning cs.CV · 2026-05-08 · unverdicted · none · ref 33 · internal anchor
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 36 · 2 links · internal anchor
PolarVLM is the first VLM framework to integrate polarimetric physical parameters via dual-stream architecture and progressive training, delivering 25.4% gains over RGB baselines on reflection and transparency tasks with a new 75K-pair PolarVQA benchmark.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs cs.CV · 2026-05-08 · unverdicted · none · ref 42 · internal anchor
Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs cs.CV · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 38 · 2 links · internal anchor
AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation cs.CL · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
Recorruption arises from visual attention suppression and positional bias in multimodal RAG; BAIR mitigates it via bottleneck attention intervention at inference time.
TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions cs.CV · 2026-04-30 · unverdicted · none · ref 31 · internal anchor
TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, plus a synthetic data engine and benchmark.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 29 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
World2Minecraft: Occupancy-Driven Simulated Scenes Construction cs.CV · 2026-04-30 · unverdicted · none · ref 10 · internal anchor
World2Minecraft turns real scenes into Minecraft worlds via occupancy prediction and releases a large indoor occupancy dataset to improve such models.
Interactive Episodic Memory with User Feedback cs.CV · 2026-04-27 · unverdicted · none · ref 39 · internal anchor
Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.
Don't Pause! Every prediction matters in a streaming video cs.CV · 2026-04-27 · unverdicted · none · ref 57 · internal anchor
SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models cs.CV · 2026-04-27 · unverdicted · none · ref 17 · internal anchor
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines cs.AI · 2026-04-26 · unverdicted · none · ref 59 · internal anchor
A two-agent adversarial rewriting framework achieves 20-40% evasion rates against LLM-based misinformation detectors under strict black-box constraints with binary feedback only, far outperforming prior methods and linking success to specific architectural properties.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 47 · internal anchor
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning cs.CV · 2026-04-24 · unverdicted · none · ref 33 · internal anchor
DocPrune is a training-free token pruning method that removes background and irrelevant tokens from document images using question and comprehension signals, yielding 3x encoder and 3.3x decoder throughput gains plus +1 F1 on M3DocRAG.

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer