Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

Dexin Wang; Guanjun Jiang; Hao Li; Lei Lv; Li Wang; Mengyu Zhou; Pascal Poupart; Qi Zhao; Xiaoxi Jiang; Yanting Miao

arxiv: 2605.12374 · v3 · pith:J2A5EGLFnew · submitted 2026-05-12 · 💻 cs.CV · cs.AI· cs.LG

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

Yanting Miao , Yutao Sun , Dexin Wang , Mengyu Zhou , Pascal Poupart , Lei Lv , Qi Zhao , Li Wang

show 3 more authors

Hao Li Xiaoxi Jiang Guanjun Jiang

This is my paper

Pith reviewed 2026-05-21 07:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords visual latent reasoningmultimodal large language modelsgranular alignmentfeature alignmentPCA latent headcapacity-guided supervisionmultimodal visual reasoninglatent modeling

0 comments

The pith

A three-level granular alignment fixes feature mismatch to enable stable visual latent reasoning in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing approaches to visual latent reasoning in multimodal large language models suffer from instability because decoder hidden states have different statistical properties than the input embeddings the model expects. The paper diagnoses this as a norm regime mismatch in pre-norm architectures and proposes to correct it with three targeted alignments. Feature-level alignment uses a PCA-based head to transform decoder outputs into compatible latents. Context-level alignment adds auxiliary visual supervision for grounding, while capacity-guided alignment focuses training on examples where the base model performs poorly. If successful, this would allow models to generate useful intermediate visual evidence internally, improving perception and reasoning without relying on external tools.

Core claim

We identify evidence for a feature-space mismatch that can contribute to instability in existing output-as-input latent methods: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume. Motivated by this diagnosis, we propose GAP, a Granular Alignment Paradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent t

What carries the argument

Granular Alignment Paradigm (GAP) that performs feature-level PCA mapping of decoder outputs to input embeddings, context-level auxiliary visual grounding, and capacity-guided selective supervision on hard examples.

If this is right

The aligned model reaches the highest mean aggregate performance on perception and reasoning among supervised variants tested.
Inference-time intervention probing shows generated latents carry task-relevant visual signal rather than just occupying extra positions.
Capacity-guided alignment focuses supervision where the base model struggles, contributing to overall stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-level approach could be tested on other pre-norm multimodal models to check for similar stability gains.
The lightweight PCA head suggests a general technique for bridging internal decoder states with input spaces across architectures.
Selective supervision on difficult examples may prove useful in other training setups to improve efficiency without full data coverage.

Load-bearing premise

The norm mismatch between decoder hidden states and input embeddings is the main source of instability in latent feedback, and the three proposed alignments correct it without introducing new instabilities.

What would settle it

Evaluating the same Qwen2.5-VL 7B setup and finding no improvement in mean aggregate perception and reasoning scores over other supervised variants, or finding that inference-time probing shows no task-relevant visual signal beyond added token slots.

read the original abstract

Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citep{xie2025mhc,li2026siamesenorm,team2026attention}. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbf{GAP}, a \textbf{G}ranular \textbf{A}lignment \textbf{P}aradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAP offers a three-level alignment fix for norm mismatches in MLLM visual latent reasoning and reports top results on Qwen2.5-VL 7B, but lacks a minimal ablation to confirm the mismatch is the actual driver.

read the letter

The main takeaway is that the authors diagnose a norm mismatch in visual latent feedback for MLLMs and propose GAP as a three-level fix: PCA-aligned head for features, auxiliary visual supervision for context, and selective supervision based on where the model struggles. They show this leads to the best aggregate performance on Qwen2.5-VL 7B among supervised variants, with probing indicating the latents provide relevant visual signals. This combination appears new as a response to the identified mismatch. The paper does well in laying out the problem with references to norm studies and then offering a structured way to address it through these alignments. The inference-time probing is a solid way to check if the latents are doing real work. The weaker part is the lack of a direct test for the mismatch being the primary cause. As noted in the stress test, there's no experiment applying just a simple norm adjustment to an existing baseline to see if that resolves the instability without the full set of additions. This means the gains could stem from the extra supervision rather than specifically closing the norm gap. Details on exact baselines and stats are also thin in the summary. This kind of work would interest researchers in multimodal LLMs focused on internal reasoning mechanisms. Readers dealing with latent token methods would get practical ideas from it. I would recommend sending it for peer review. The core idea is sound enough and the method is specific enough to benefit from detailed feedback.

Referee Report

2 major / 2 minor

Summary. The paper proposes GAP, a Granular Alignment Paradigm for visual latent reasoning in MLLMs. It diagnoses a feature-space mismatch between decoder hidden states (in a different norm regime) and input embeddings in pre-norm models as a source of instability in existing output-as-input latent methods. GAP introduces three alignments: feature-level via a PCA-aligned latent head to map decoder outputs to input-compatible latents, context-level grounding with auxiliary visual supervision, and capacity-guided selective supervision on examples where the base model struggles. On Qwen2.5-VL 7B, the resulting model reports the best mean aggregate perception and reasoning performance among supervised variants, with inference-time probing suggesting the latents supply task-relevant visual signal beyond added token capacity.

Significance. If the reported gains and probing results are confirmed with full quantitative controls, the granular multi-level alignment strategy would provide a practical and interpretable route to more stable visual latent reasoning in MLLMs, reducing reliance on external tools. The work usefully extends cited observations on norm regimes into a concrete architecture with selective supervision and auxiliary targets; the probing experiment is a clear strength for establishing that generated latents carry meaningful visual information.

major comments (2)

The central diagnosis that feature-space mismatch is a primary contributor to instability lacks a direct causal test. No ablation is described that applies a minimal norm correction (e.g., layer-norm or scaling) to an existing output-as-input baseline and measures whether the performance gap or instability closes; without this, it remains possible that observed gains stem from increased supervision volume or selective masking rather than resolution of the cited mismatch.
Abstract and experimental claims: performance gains and probing results are asserted at a high level but supply no quantitative details on baselines, statistical significance, data splits, or ablation controls. This makes the central claim that GAP achieves the best mean aggregate performance among supervised variants difficult to evaluate from the provided text.

minor comments (2)

Notation for the three alignment mechanisms should be introduced with explicit equations or pseudocode in the method section to clarify how the PCA-aligned head, context targets, and capacity mask interact during training.
Ensure all cited works (e.g., xie2025mhc) are fully referenced with arXiv identifiers or DOIs for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications drawn directly from our manuscript and indicate the revisions we will incorporate.

read point-by-point responses

Referee: The central diagnosis that feature-space mismatch is a primary contributor to instability lacks a direct causal test. No ablation is described that applies a minimal norm correction (e.g., layer-norm or scaling) to an existing output-as-input baseline and measures whether the performance gap or instability closes; without this, it remains possible that observed gains stem from increased supervision volume or selective masking rather than resolution of the cited mismatch.

Authors: We appreciate the referee's point on establishing causality. Our diagnosis is grounded in the cited observations on norm regimes in pre-norm MLLMs (Xie et al., Li et al., and related attention analyses), which show decoder hidden states occupy a different norm range than input embeddings. The PCA-aligned latent head in GAP is explicitly constructed to project decoder outputs into the input-compatible space, providing a targeted resolution rather than a generic correction. Nevertheless, we agree that an isolated minimal-norm ablation on a pure output-as-input baseline would strengthen the causal claim. In the revised manuscript we will add this ablation, reporting stability metrics and performance deltas when only layer-norm or scaling is applied to the baseline before comparing against full GAP. revision: yes
Referee: Abstract and experimental claims: performance gains and probing results are asserted at a high level but supply no quantitative details on baselines, statistical significance, data splits, or ablation controls. This makes the central claim that GAP achieves the best mean aggregate performance among supervised variants difficult to evaluate from the provided text.

Authors: The full manuscript contains quantitative tables (Section 4) reporting per-task and aggregate scores for Qwen2.5-VL 7B against multiple supervised baselines, including exact mean perception/reasoning aggregates, data-split descriptions, and ablation results that isolate each alignment level. The probing experiment (Section 4.4) quantifies task-relevant signal via intervention accuracy deltas. We acknowledge the abstract remains high-level; in revision we will insert concise quantitative highlights (e.g., aggregate scores and key baseline comparisons) while preserving length limits, and we will add statistical significance markers and explicit data-split details to the experimental narrative for easier evaluation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external citations and novel empirical alignments

full rationale

The paper's chain begins with a mismatch diagnosis supported by external citations (xie2025mhc, li2026siamesenorm, team2026attention) on norm regimes in pre-norm MLLMs, then introduces three new alignment mechanisms (PCA-aligned latent head, context-level grounding, capacity-guided selection) as additions rather than reductions of fitted quantities. Performance results on Qwen2.5-VL 7B are reported as outcomes of supervised training and probing, without any step where a prediction reduces by construction to an input parameter or self-citation chain. No self-definitional loops, fitted-input renamings, or load-bearing self-citations appear in the provided text; the argument remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The PCA-aligned latent head implies at least one learned projection matrix whose fitting details are not provided.

pith-pipeline@v0.9.0 · 5824 in / 1193 out tokens · 34652 ms · 2026-05-21T07:57:54.651403+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify evidence for a feature-space mismatch... decoder hidden states... occupy a substantially different norm regime from the input embeddings... PCA-aligned latent head maps decoder outputs into input-compatible visual latents
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

difficulty-aware latent assignment... assigns latent supervision selectively to examples where the base MLLM struggles

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 18 internal anchors

[1]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Imagine while reasoning in space: Multimodal visualization-of-thought , author=. arXiv preprint arXiv:2501.07542 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Latent sketchpad: Sketching visual thoughts to elicit multimodal reasoning in mllms, 2025

Latent sketchpad: Sketching visual thoughts to elicit multimodal reasoning in mllms , author=. arXiv preprint arXiv:2510.24514 , year=

work page arXiv
[3]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025
[6]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[7]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

DeepEyes: Incentivizing" Thinking with Images" via Reinforcement Learning , author=. arXiv preprint arXiv:2505.14362 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Got-r1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning , author=. arXiv preprint arXiv:2505.17022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Monet: Reasoning in latent visual space beyond images and language,

Monet: Reasoning in latent visual space beyond images and language , author=. arXiv preprint arXiv:2511.21395 , year=

work page arXiv
[10]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

From explicit cot to implicit cot: Learning to internalize cot step by step , author=. arXiv preprint arXiv:2405.14838 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Training Large Language Models to Reason in a Continuous Latent Space

Training large language models to reason in a continuous latent space , author=. arXiv preprint arXiv:2412.06769 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Latent Visual Reasoning

Latent visual reasoning , author=. arXiv preprint arXiv:2509.24251 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens , author=. arXiv preprint arXiv:2506.17218 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens,

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens , author=. arXiv preprint arXiv:2511.19418 , year=

work page arXiv
[15]

Latent implicit visual rea- soning.arXiv preprint arXiv:2512.21218, 2025

Latent Implicit Visual Reasoning , author=. arXiv preprint arXiv:2512.21218 , year=

work page arXiv
[16]

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning , author=. arXiv preprint arXiv:2601.14750 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Vision-aligned Latent Reasoning for Multi-modal Large Language Model , author=. arXiv preprint arXiv:2602.04476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2601.10129 , year=

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning , author=. arXiv preprint arXiv:2601.10129 , year=

work page arXiv
[19]

arXiv preprint arXiv:2602.20980 , year=

CrystaL: Spontaneous Emergence of Visual Latents in MLLMs , author=. arXiv preprint arXiv:2602.20980 , year=

work page arXiv
[20]

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning , author=. arXiv preprint arXiv:2604.10500 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025
[22]

Attention Residuals

Attention Residuals , author=. arXiv preprint arXiv:2603.15031 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2602.08064 , year=

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm , author=. arXiv preprint arXiv:2602.08064 , year=

work page internal anchor Pith review arXiv
[24]

mHC: Manifold-Constrained Hyper-Connections

mhc: Manifold-constrained hyper-connections , author=. arXiv preprint arXiv:2512.24880 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers , author=. arXiv preprint arXiv:2506.23918 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Reasoning within the mind: Dynamic multimodal interleaving in latent space , author=. arXiv preprint arXiv:2512.12623 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

2026 , howpublished =

Gemma 4 Model Card , author =. 2026 , howpublished =

work page 2026
[28]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

arXiv preprint arXiv:2507.07998 , year=

Pyvision: Agentic vision with dynamic tooling , author=. arXiv preprint arXiv:2507.07998 , year=

work page arXiv
[30]

International conference on machine learning , pages=

On layer normalization in the transformer architecture , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[1] [1]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Imagine while reasoning in space: Multimodal visualization-of-thought , author=. arXiv preprint arXiv:2501.07542 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Latent sketchpad: Sketching visual thoughts to elicit multimodal reasoning in mllms, 2025

Latent sketchpad: Sketching visual thoughts to elicit multimodal reasoning in mllms , author=. arXiv preprint arXiv:2510.24514 , year=

work page arXiv

[3] [3]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025

[6] [6]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[7] [7]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

DeepEyes: Incentivizing" Thinking with Images" via Reinforcement Learning , author=. arXiv preprint arXiv:2505.14362 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Got-r1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning , author=. arXiv preprint arXiv:2505.17022 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Monet: Reasoning in latent visual space beyond images and language,

Monet: Reasoning in latent visual space beyond images and language , author=. arXiv preprint arXiv:2511.21395 , year=

work page arXiv

[10] [10]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

From explicit cot to implicit cot: Learning to internalize cot step by step , author=. arXiv preprint arXiv:2405.14838 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Training Large Language Models to Reason in a Continuous Latent Space

Training large language models to reason in a continuous latent space , author=. arXiv preprint arXiv:2412.06769 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Latent Visual Reasoning

Latent visual reasoning , author=. arXiv preprint arXiv:2509.24251 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens , author=. arXiv preprint arXiv:2506.17218 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens,

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens , author=. arXiv preprint arXiv:2511.19418 , year=

work page arXiv

[15] [15]

Latent implicit visual rea- soning.arXiv preprint arXiv:2512.21218, 2025

Latent Implicit Visual Reasoning , author=. arXiv preprint arXiv:2512.21218 , year=

work page arXiv

[16] [16]

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning , author=. arXiv preprint arXiv:2601.14750 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Vision-aligned Latent Reasoning for Multi-modal Large Language Model , author=. arXiv preprint arXiv:2602.04476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2601.10129 , year=

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning , author=. arXiv preprint arXiv:2601.10129 , year=

work page arXiv

[19] [19]

arXiv preprint arXiv:2602.20980 , year=

CrystaL: Spontaneous Emergence of Visual Latents in MLLMs , author=. arXiv preprint arXiv:2602.20980 , year=

work page arXiv

[20] [20]

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning , author=. arXiv preprint arXiv:2604.10500 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025

[22] [22]

Attention Residuals

Attention Residuals , author=. arXiv preprint arXiv:2603.15031 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

arXiv preprint arXiv:2602.08064 , year=

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm , author=. arXiv preprint arXiv:2602.08064 , year=

work page internal anchor Pith review arXiv

[24] [24]

mHC: Manifold-Constrained Hyper-Connections

mhc: Manifold-constrained hyper-connections , author=. arXiv preprint arXiv:2512.24880 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers , author=. arXiv preprint arXiv:2506.23918 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Reasoning within the mind: Dynamic multimodal interleaving in latent space , author=. arXiv preprint arXiv:2512.12623 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

2026 , howpublished =

Gemma 4 Model Card , author =. 2026 , howpublished =

work page 2026

[28] [28]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

arXiv preprint arXiv:2507.07998 , year=

Pyvision: Agentic vision with dynamic tooling , author=. arXiv preprint arXiv:2507.07998 , year=

work page arXiv

[30] [30]

International conference on machine learning , pages=

On layer normalization in the transformer architecture , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020