pith. sign in
Pith Number

pith:YPHKM4EJ

pith:2026:YPHKM4EJRCZ5UJCQL2UWL3Y2P2
not attested not anchored not stored refs resolved

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Guanghao Zhang, Hao Jiang, Longxiang Zhang, Pipei Huang, Weilong Dai

A dual-LoRA architecture with a routing gate lets multimodal embeddings add chain-of-thought reasoning only when it improves results.

arxiv:2605.14448 v1 · 2026-05-14 · cs.CV · cs.CL · cs.IR

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{YPHKM4EJRCZ5UJCQL2UWL3Y2P2}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

C2weakest assumption

The self-supervised routing gate accurately identifies inputs where reasoning is unnecessary or harmful, and that detaching gradients at the LoRA interface fully resolves optimization conflicts without introducing new biases in the learned adapters.

C3one line summary

TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% added parameters and up to 50% fewer reasoning tokens.

References

43 extracted · 43 resolved · 7 Pith anchors

[1] Qwen3-VL Technical Report 2025 · arXiv:2511.21631
[2] Llm2vec: Large language models are secretly powerful text encoders 2024
[3] Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks 2024
[4] Think then embed: Generative con- text improves multimodal embedding 2025
[5] Flashattention-2: Faster attention with better parallelism and work partitioning 2024
Receipt and verification
First computed 2026-05-17T23:39:06.927990Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

c3cea6708988b3da24505ea965ef1a7ea68849a2dfc53da52becc64b1d2f27aa

Aliases

arxiv: 2605.14448 · arxiv_version: 2605.14448v1 · doi: 10.48550/arxiv.2605.14448 · pith_short_12: YPHKM4EJRCZ5 · pith_short_16: YPHKM4EJRCZ5UJCQ · pith_short_8: YPHKM4EJ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/YPHKM4EJRCZ5UJCQL2UWL3Y2P2 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c3cea6708988b3da24505ea965ef1a7ea68849a2dfc53da52becc64b1d2f27aa
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "405fe0f1b8407cb4324f8cb7cfe2a184ddbb68ca88a963435733d0569cc84b45",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.IR"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-14T06:41:53Z",
    "title_canon_sha256": "81f20f7a08c7da0115639609c88d140d5c2e5c032065544a9c24fb5f14f2b914"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.14448",
    "kind": "arxiv",
    "version": 1
  }
}