hub Mixed citations

Monet: Reasoning in latent visual space beyond images and language

· 2025 · arXiv 2511.21395

Mixed citation behavior. Most common role is background (64%).

20 Pith papers citing it

Background 64% of classified citations

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 4

citation-polarity summary

background 7 baseline 4

representative citing papers

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

cs.CV · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.

Hybrid Latent Reasoning with Decoupled Policy Optimization

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

cs.CV · 2026-03-31 · unverdicted · novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.

Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

cs.CL · 2026-01-11 · unverdicted · novelty 7.0

Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.

Leveraging Latent Visual Reasoning in Silence

cs.CV · 2026-05-18 · conditional · novelty 6.0

Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.

Latent Action Control for Reasoning-Guided Unified Image Generation

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, and BLINK benchmarks.

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

cs.CV · 2026-04-12 · unverdicted · novelty 6.0 · 3 refs

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

cs.CV · 2026-02-04 · unverdicted · novelty 6.0

VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

What's Holding Back Latent Visual Reasoning?

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

Latent visual reasoning fails in current models because standard datasets make oracle latents uninformative and inference-time latents collapse away from useful representations.

MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five other benchmarks.

Semantic-Enriched Latent Visual Reasoning

cs.CV · 2026-05-19

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

cs.CV · 2026-05-12 · 3 refs

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

cs.CV · 2026-04-06

citing papers explorer

Showing 20 of 20 citing papers.

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 44
LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling cs.CV · 2026-05-13 · unverdicted · none · ref 46
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs cs.CV · 2026-05-12 · unverdicted · none · ref 12
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization cs.CV · 2026-05-11 · unverdicted · none · ref 26 · 2 links
DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.
Hybrid Latent Reasoning with Decoupled Policy Optimization cs.CV · 2026-04-22 · unverdicted · none · ref 32
HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators cs.CV · 2026-03-31 · unverdicted · none · ref 24
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning cs.CL · 2026-01-11 · unverdicted · none · ref 31
Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.
Leveraging Latent Visual Reasoning in Silence cs.CV · 2026-05-18 · conditional · none · ref 32
Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.
Latent Action Control for Reasoning-Guided Unified Image Generation cs.CV · 2026-05-16 · unverdicted · none · ref 35
Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning cs.CL · 2026-05-08 · unverdicted · none · ref 13
RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, and BLINK benchmarks.
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs cs.LG · 2026-05-04 · unverdicted · none · ref 31
Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning cs.CV · 2026-04-12 · unverdicted · none · ref 60 · 3 links
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models cs.CV · 2026-04-08 · unverdicted · none · ref 61
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CV · 2026-04-08 · unverdicted · none · ref 57
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model cs.CV · 2026-02-04 · unverdicted · none · ref 29
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
What's Holding Back Latent Visual Reasoning? cs.CV · 2026-05-18 · unverdicted · none · ref 18
Latent visual reasoning fails in current models because standard datasets make oracle latents uninformative and inference-time latents collapse away from useful representations.
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering cs.CV · 2026-04-10 · unverdicted · none · ref 18
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five other benchmarks.
Semantic-Enriched Latent Visual Reasoning cs.CV · 2026-05-19 · unreviewed · ref 16
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models cs.CV · 2026-05-12 · unreviewed · ref 9 · 3 links
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models cs.CV · 2026-04-06 · unreviewed · ref 126

Monet: Reasoning in latent visual space beyond images and language

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer