TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Chaitanya Ahuja; Fan Xia; Hanchao Yu; Jiangfan Zhang; Jianpeng Cheng; Jun Xiao; Qi Guo; Shaodan Zhai; Shlok Kumar Mishra; Wentao Bao

arxiv: 2605.16638 · v1 · pith:FYDMEK4Tnew · submitted 2026-05-15 · 💻 cs.AI

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Jianpeng Cheng , Xian Wu , Jiangfan Zhang , Wentao Bao , Chaitanya Ahuja , Shlok Kumar Mishra , Hanchao Yu , Yang Gao

show 5 more authors

Fan Xia Qi Guo Shaodan Zhai Xiangjun Fan Jun Xiao

This is my paper

Pith reviewed 2026-05-20 17:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal embeddingschain-of-thoughtlatent variablescontrastive learningefficient inferenceuniversal multimodal embeddinggenerative models

0 comments

The pith

Latent think tokens replace explicit chain-of-thought reasoning in multimodal embeddings while preserving performance at constant inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to obtain the performance gains of chain-of-thought reasoning for universal multimodal embeddings without paying the cost of generating explicit reasoning traces at inference time. It introduces latent think tokens that are trained first with a generation loss to mimic CoT traces and then paired with contrastive loss to produce the final embedding token. A reader would care because explicit CoT improves representation quality on multimodal benchmarks but adds prohibitive generation steps; the latent approach keeps the quality gain while making inference cost independent of reasoning length. The resulting TTE-Flash-2B model beats its explicit-CoT baseline on MMEB-v2 and yields think tokens that remain readable as text and visualizable.

Core claim

Treating latent think tokens as hidden variables whose observed counterparts are explicit CoT traces, the authors jointly optimize the tokens via a CoT generation loss followed by contrastive loss on the embedding token extracted from the same LLM backbone. This produces reasoning-aware multimodal representations whose inference cost remains fixed regardless of the number of think tokens. TTE-Flash-2B outperforms the explicit-CoT version on MMEB-v2, the latent tokens prove interpretable in both textual and visual form, and zero-shot video results show performance scaling with token count.

What carries the argument

Latent think tokens that are first optimized with CoT generation loss and then used to form embedding tokens via contrastive loss inside a shared LLM backbone.

Load-bearing premise

Jointly optimizing think tokens via CoT generation loss and embedding tokens via contrastive loss on the same LLM backbone will retain the representation benefits of explicit reasoning without requiring explicit generation at test time.

What would settle it

An experiment in which TTE-Flash-2B scores below the explicit-CoT baseline on MMEB-v2 or in which the latent think tokens decode into incoherent or non-reasoning text would disprove the central claim.

read the original abstract

Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an <eos> embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how think and embeddings tokens should be extracted from the same LLM backbone. 2) how the tokens should be trained as two dependent tasks. We introduce TTE-Flash-2B, a reasoning-aware multimodal representation model that outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, while producing latent think tokens that are interpretable both textually and visually. Furthermore, zero-shot evaluation across 15 video datasets reveals scaling behavior as the number of think tokens increases, and motivating a pilot study of adaptive think budget allocation based on task requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces explicit CoT with latent think tokens trained jointly via generation and contrastive losses on one backbone, claiming constant-cost gains on MMEB-v2, but the abstract leaves open whether those tokens actually encode reasoning.

read the letter

The main point is that TTE-Flash trains latent think tokens with a CoT generation loss and then uses them to condition embedding tokens under contrastive loss, all inside the same LLM. This is meant to deliver reasoning-aware multimodal embeddings without paying for explicit generation at inference time. TTE-Flash-2B reportedly beats the explicit-CoT baseline on MMEB-v2 and shows scaling with more think tokens across 15 video datasets, plus some textual and visual interpretability of the latent tokens themselves.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TTE-Flash, which replaces explicit Chain-of-Thought (CoT) reasoning with latent think tokens in multimodal embedding models. Think tokens are optimized via a CoT generation loss and embedding tokens via a contrastive loss on the same LLM backbone to produce reasoning-aware representations at constant inference cost. TTE-Flash-2B is reported to outperform its explicit-CoT counterpart on MMEB-v2, with interpretable latent think tokens (textually and visually), scaling behavior as the number of think tokens increases on 15 video datasets, and a pilot study on adaptive think budget allocation.

Significance. If the performance claims hold under rigorous scrutiny, the work could meaningfully advance efficient multimodal representation learning by internalizing reasoning steps into latent variables without test-time generation overhead. The dual-task training design and scaling observations with variable think-token budgets are potentially impactful for adaptive computation in multimodal systems. The reported interpretability of latent tokens is a positive aspect that could aid future analysis of reasoning in embeddings.

major comments (2)

[Abstract] Abstract: the outperformance claim on MMEB-v2 and scaling results on 15 video datasets are presented without any description of baselines, statistical significance testing, data splits, or evaluation protocols. These omissions are load-bearing because the central claim is that latent think tokens match or exceed explicit-CoT performance at constant cost.
[Training Methodology] Training of the two dependent tasks: the manuscript does not specify whether gradients from the contrastive loss on embedding tokens back-propagate through the think tokens or whether an auxiliary reconstruction term maintains alignment with reasoning traces once explicit generation is removed. This detail directly affects whether the latent tokens internalize reasoning or merely serve as training artifacts.

minor comments (2)

[Architectural Designs] The two architectural design questions (token extraction from the LLM backbone and dependent-task training) would benefit from an explicit diagram or pseudocode showing token flow and loss application.
[Notation and Terminology] Notation for 'think tokens' versus 'embedding tokens' should be used consistently to prevent reader confusion across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments, which have helped us identify areas where the manuscript can be clarified and strengthened. We address each major comment below and have revised the manuscript to incorporate the suggested improvements while preserving the core contributions of TTE-Flash.

read point-by-point responses

Referee: [Abstract] Abstract: the outperformance claim on MMEB-v2 and scaling results on 15 video datasets are presented without any description of baselines, statistical significance testing, data splits, or evaluation protocols. These omissions are load-bearing because the central claim is that latent think tokens match or exceed explicit-CoT performance at constant cost.

Authors: We agree that the abstract would benefit from additional context to better support the central performance claims. In the revised manuscript, we have expanded the abstract to briefly reference the explicit-CoT counterpart as the primary baseline, note the use of standard MMEB-v2 evaluation protocols, and indicate that scaling results are reported across 15 established video datasets. Full details on data splits, evaluation metrics, and any statistical testing (including confidence intervals where computed) remain in Section 4, with a cross-reference added to the abstract. This revision addresses the concern without exceeding typical abstract length constraints. revision: yes
Referee: [Training Methodology] Training of the two dependent tasks: the manuscript does not specify whether gradients from the contrastive loss on embedding tokens back-propagate through the think tokens or whether an auxiliary reconstruction term maintains alignment with reasoning traces once explicit generation is removed. This detail directly affects whether the latent tokens internalize reasoning or merely serve as training artifacts.

Authors: We appreciate this observation on the training dynamics. The original manuscript describes the sequential optimization of think tokens via CoT generation loss followed by embedding tokens via contrastive loss but omits explicit gradient flow details. In our implementation, gradients from the contrastive loss do back-propagate through the think tokens, allowing them to adapt and internalize reasoning information. No auxiliary reconstruction term is employed after explicit CoT generation is removed; alignment is maintained through the joint training objective and the initial CoT loss phase. We have added a new paragraph and accompanying diagram in the revised Training Methodology section to explicitly describe the gradient paths, task dependency, and absence of post-hoc reconstruction, confirming that the latent tokens are not mere artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation or performance claims

full rationale

The paper describes an empirical training procedure that optimizes latent think tokens via a CoT generation loss and embedding tokens via a contrastive loss on a shared LLM backbone, then evaluates the resulting representations on the external MMEB-v2 benchmark and 15 video datasets. No derivation chain is presented that reduces a claimed prediction or first-principles result to its own inputs by construction; the performance advantage is reported as an experimental outcome rather than a quantity defined by the paper's equations. The two-task dependency is a standard multi-objective optimization setup without self-referential definitions, fitted inputs renamed as predictions, or load-bearing self-citations that would force the central result. This is a typical empirical architecture paper whose claims rest on benchmark measurements outside the training loop itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; paper relies on standard assumptions that contrastive loss produces useful embeddings and that CoT generation loss can supervise latent variables, but no explicit free parameters or axioms are stated.

pith-pipeline@v0.9.0 · 5814 in / 1072 out tokens · 66245 ms · 2026-05-20T17:53:36.386807+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

register-based mechanisms ... single pre-filling pass

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 16 internal anchors

[1]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

From explicit cot to implicit cot: Learning to internalize cot step by step , author=. arXiv preprint arXiv:2405.14838 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[3]

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents , author=. arXiv preprint arXiv:2507.04590 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Vlm2vec: Training vision-language models for massive multimodal embedding tasks , author=. arXiv preprint arXiv:2410.05160 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Llava-cot: Let vision language models reason step-by-step , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[6]

Categorical Reparameterization with Gumbel-Softmax

Categorical reparameterization with gumbel-softmax , author=. arXiv preprint arXiv:1611.01144 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Think before you speak: Training language models with pause tokens,

Think before you speak: Training language models with pause tokens , author=. arXiv preprint arXiv:2310.02226 , year=

work page arXiv
[8]

arXiv preprint arXiv:2411.02571 , year=

Mm-embed: Universal multimodal retrieval with multimodal llms , author=. arXiv preprint arXiv:2411.02571 , year=

work page arXiv
[9]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Bridging modalities: Improving universal multimodal retrieval by multimodal large language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[10]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

GME: improving universal multimodal retrieval by multimodal LLMs , author=. arXiv preprint arXiv:2412.16855 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Lamra: Large multimodal model as your advanced retrieval assistant , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[12]

E5-V: Universal Embeddings with Multimodal Large Language Models

E5-v: Universal embeddings with multimodal large language models , author=. arXiv preprint arXiv:2407.12580 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control , author=. arXiv preprint arXiv:2604.06156 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

ColPali: Efficient Document Retrieval with Vision Language Models

Colpali: Efficient document retrieval with vision language models , author=. arXiv preprint arXiv:2407.01449 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[16]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Next-qa: Next phase of question-answering to explaining temporal actions , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[17]

Proceedings of the AAAI conference on artificial intelligence , volume=

Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[18]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

Colbertv2: Effective and efficient retrieval via lightweight late interaction , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2022
[19]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

Breaking the modality barrier: Universal embedding learning with multimodal llms , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

work page
[20]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Unime-v2: Mllm-as-a-judge for universal multimodal embedding learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[21]

PLUME: Latent Reasoning Based Universal Multimodal Embedding

PLUME: Latent Reasoning Based Universal Multimodal Embedding , author=. arXiv preprint arXiv:2604.02073 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Scaling Latent Reasoning via Looped Language Models

Scaling latent reasoning via looped language models , author=. arXiv preprint arXiv:2510.25741 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[24]

Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

Hybrid latent reasoning via reinforcement learning , author=. arXiv preprint arXiv:2505.18454 , year=

work page arXiv
[25]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Codi: Compressing chain-of-thought into continuous space via self-distillation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[26]

arXiv preprint arXiv:2311.01460 , year=

Implicit chain of thought reasoning via knowledge distillation , author=. arXiv preprint arXiv:2311.01460 , year=

work page arXiv
[27]

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Ladir: Latent diffusion enhances llms for text reasoning , author=. arXiv preprint arXiv:2510.04573 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2505.16552 (2025)

Think silently, think fast: Dynamic latent compression of llm reasoning chains , author=. arXiv preprint arXiv:2505.16552 , year=

work page arXiv
[29]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[30]

International conference on machine learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022
[31]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Coca: Contrastive captioners are image-text foundation models , author=. arXiv preprint arXiv:2205.01917 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Scaling up test-time compute with latent reasoning: A recurrent depth approach , author=. arXiv preprint arXiv:2502.05171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Training Large Language Models to Reason in a Continuous Latent Space

Training large language models to reason in a continuous latent space , author=. arXiv preprint arXiv:2412.06769 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[36]

Vision Transformers Need Registers

Vision transformers need registers , author=. arXiv preprint arXiv:2309.16588 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

arXiv preprint arXiv:2410.14072 , year=

Efficient vision-language models by summarizing visual tokens into compact registers , author=. arXiv preprint arXiv:2410.14072 , year=

work page arXiv
[38]

arXiv preprint arXiv:2602.08332 , year=

Latent Reasoning with Supervised Thinking States , author=. arXiv preprint arXiv:2602.08332 , year=

work page arXiv
[39]

arXiv preprint arXiv:2510.05014 , year=

Think then embed: Generative context improves multimodal embedding , author=. arXiv preprint arXiv:2510.05014 , year=

work page arXiv
[40]

arXiv preprint arXiv:2511.00405 , year=

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings , author=. arXiv preprint arXiv:2511.00405 , year=

work page arXiv
[41]

ICLR , year=

Representation alignment for generation: Training diffusion transformers is easier than you think , author=. ICLR , year=

work page
[42]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[43]

TMLR , year=

Dinov2: Learning robust visual features without supervision , author=. TMLR , year=

work page

[1] [1]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

From explicit cot to implicit cot: Learning to internalize cot step by step , author=. arXiv preprint arXiv:2405.14838 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[3] [3]

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents , author=. arXiv preprint arXiv:2507.04590 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Vlm2vec: Training vision-language models for massive multimodal embedding tasks , author=. arXiv preprint arXiv:2410.05160 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Llava-cot: Let vision language models reason step-by-step , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[6] [6]

Categorical Reparameterization with Gumbel-Softmax

Categorical reparameterization with gumbel-softmax , author=. arXiv preprint arXiv:1611.01144 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Think before you speak: Training language models with pause tokens,

Think before you speak: Training language models with pause tokens , author=. arXiv preprint arXiv:2310.02226 , year=

work page arXiv

[8] [8]

arXiv preprint arXiv:2411.02571 , year=

Mm-embed: Universal multimodal retrieval with multimodal llms , author=. arXiv preprint arXiv:2411.02571 , year=

work page arXiv

[9] [9]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Bridging modalities: Improving universal multimodal retrieval by multimodal large language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[10] [10]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

GME: improving universal multimodal retrieval by multimodal LLMs , author=. arXiv preprint arXiv:2412.16855 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Lamra: Large multimodal model as your advanced retrieval assistant , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[12] [12]

E5-V: Universal Embeddings with Multimodal Large Language Models

E5-v: Universal embeddings with multimodal large language models , author=. arXiv preprint arXiv:2407.12580 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control , author=. arXiv preprint arXiv:2604.06156 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

ColPali: Efficient Document Retrieval with Vision Language Models

Colpali: Efficient document retrieval with vision language models , author=. arXiv preprint arXiv:2407.01449 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[16] [16]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Next-qa: Next phase of question-answering to explaining temporal actions , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[17] [17]

Proceedings of the AAAI conference on artificial intelligence , volume=

Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[18] [18]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

Colbertv2: Effective and efficient retrieval via lightweight late interaction , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2022

[19] [19]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

Breaking the modality barrier: Universal embedding learning with multimodal llms , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

work page

[20] [20]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Unime-v2: Mllm-as-a-judge for universal multimodal embedding learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[21] [21]

PLUME: Latent Reasoning Based Universal Multimodal Embedding

PLUME: Latent Reasoning Based Universal Multimodal Embedding , author=. arXiv preprint arXiv:2604.02073 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Scaling Latent Reasoning via Looped Language Models

Scaling latent reasoning via looped language models , author=. arXiv preprint arXiv:2510.25741 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[24] [24]

Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

Hybrid latent reasoning via reinforcement learning , author=. arXiv preprint arXiv:2505.18454 , year=

work page arXiv

[25] [25]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Codi: Compressing chain-of-thought into continuous space via self-distillation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[26] [26]

arXiv preprint arXiv:2311.01460 , year=

Implicit chain of thought reasoning via knowledge distillation , author=. arXiv preprint arXiv:2311.01460 , year=

work page arXiv

[27] [27]

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Ladir: Latent diffusion enhances llms for text reasoning , author=. arXiv preprint arXiv:2510.04573 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

arXiv preprint arXiv:2505.16552 (2025)

Think silently, think fast: Dynamic latent compression of llm reasoning chains , author=. arXiv preprint arXiv:2505.16552 , year=

work page arXiv

[29] [29]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[30] [30]

International conference on machine learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022

[31] [31]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Coca: Contrastive captioners are image-text foundation models , author=. arXiv preprint arXiv:2205.01917 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Scaling up test-time compute with latent reasoning: A recurrent depth approach , author=. arXiv preprint arXiv:2502.05171 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Training Large Language Models to Reason in a Continuous Latent Space

Training large language models to reason in a continuous latent space , author=. arXiv preprint arXiv:2412.06769 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page

[36] [36]

Vision Transformers Need Registers

Vision transformers need registers , author=. arXiv preprint arXiv:2309.16588 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

arXiv preprint arXiv:2410.14072 , year=

Efficient vision-language models by summarizing visual tokens into compact registers , author=. arXiv preprint arXiv:2410.14072 , year=

work page arXiv

[38] [38]

arXiv preprint arXiv:2602.08332 , year=

Latent Reasoning with Supervised Thinking States , author=. arXiv preprint arXiv:2602.08332 , year=

work page arXiv

[39] [39]

arXiv preprint arXiv:2510.05014 , year=

Think then embed: Generative context improves multimodal embedding , author=. arXiv preprint arXiv:2510.05014 , year=

work page arXiv

[40] [40]

arXiv preprint arXiv:2511.00405 , year=

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings , author=. arXiv preprint arXiv:2511.00405 , year=

work page arXiv

[41] [41]

ICLR , year=

Representation alignment for generation: Training diffusion transformers is easier than you think , author=. ICLR , year=

work page

[42] [42]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[43] [43]

TMLR , year=

Dinov2: Learning robust visual features without supervision , author=. TMLR , year=

work page