Pith Number

pith:4ZAXKMRQ

pith:2026:4ZAXKMRQEX3UBNINNER7WWVLJ3

not attested not anchored not stored refs resolved

Deep Pre-Alignment for VLMs

Bo Zheng, Jun Song, Kaidong Zhang, Kechen Fang, Tianyu Yu, Yicheng Zhang, Yuan Yao, Zihao Wan

Deep Pre-Alignment replaces the ViT encoder with a small VLM perceiver to align visual features deeply with the LLM's text space.

arxiv:2605.15300 v1 · 2026-05-14 · cs.CV

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{4ZAXKMRQEX3UBNINNER7WWVLJ3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale; by offloading alignment to the perceiver, DPA achieves a 32.9% reduction in language capability forgetting over 3 text benchmarks.

C2weakest assumption

That feeding the LLM with features from a small VLM perceiver (rather than a standard ViT plus projector) produces sufficiently deep pre-alignment so that the LLM's initial layers no longer perform superficial modality matching, as stated in the motivation citing prior alignment analyses.

C3one line summary

Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.

References

172 extracted · 172 resolved · 29 Pith anchors

[1] International conference on machine learning , pages= 2023

[2] Flamingo: a visual language model for few-shot learning , author=. NeurIPS , volume=

[3] Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu , booktitle=

[4] Byeon, Minwoo and Park, Beomhee and Kim, Haecheon and Lee, Sungjun and Baek, Woonhyuk and Kim, Saehoon , year =

[5] Schuhmann, Christoph and Beaumont, Romain and Vencu, Richard and Gordon, Cade and Wightman, Ross and Cherti, Mehdi and Coombes, Theo and Katta, Aarush and Mullis, Clayton and Wortsman, Mitchell and ot

Formal links

2 machine-checked theorem links

Receipt and verification

First computed	2026-05-20T00:00:51.496070Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

e64175323025f740b50d6923fb5aab4ee0584fd4493f7cdf02af0e3e2a6b087d

Aliases

arxiv: 2605.15300 · arxiv_version: 2605.15300v1 · doi: 10.48550/arxiv.2605.15300 · pith_short_12: 4ZAXKMRQEX3U · pith_short_16: 4ZAXKMRQEX3UBNIN · pith_short_8: 4ZAXKMRQ

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/4ZAXKMRQEX3UBNINNER7WWVLJ3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e64175323025f740b50d6923fb5aab4ee0584fd4493f7cdf02af0e3e2a6b087d

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "017f88a96ca487ffa8bf0917da4183da73b1aac9cfb192fa146f4ce820b65e5e",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-14T18:14:15Z",
    "title_canon_sha256": "7e175ea771a9c586b737a13198eb361d0be7b9493deaa788087f9f5b5b77380d"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15300",
    "kind": "arxiv",
    "version": 1
  }
}