pith. sign in
Pith Number

pith:S644O7FL

pith:2026:S644O7FLM6W7PE5FHLZRWB7SYS
not attested not anchored not stored refs pending

Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

Byeonghyun Pak, Byeongju Woo, Sangwoo Mo, Stella X. Yu, Zilin Wang

CAFT aligns local descriptions in long captions to image regions before forming global scene representations.

arxiv:2602.02977 v2 · 2026-02-03 · cs.CV · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{S644O7FLM6W7PE5FHLZRWB7SYS}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that CAFT learns fine-grained representations that localize textual semantics in image regions without explicit region-level supervision.

C2weakest assumption

The assumption that long captions naturally contain local descriptions that correspond to distinct scene parts, allowing the model to discover localized alignments without any region-level supervision or explicit part annotations.

C3one line summary

CAFT achieves state-of-the-art results on long-text image retrieval benchmarks by jointly learning local text-region alignments and global image-text alignments through fine-to-coarse encoders.

Formal links

2 machine-checked theorem links

Cited by

1 paper in Pith

Receipt and verification
First computed 2026-05-18T03:09:23.985521Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

97b9c77cab67adf793a53af31b07f2c48a410044b822197921209968f94d9c93

Aliases

arxiv: 2602.02977 · arxiv_version: 2602.02977v2 · doi: 10.48550/arxiv.2602.02977 · pith_short_12: S644O7FLM6W7 · pith_short_16: S644O7FLM6W7PE5F · pith_short_8: S644O7FL
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/S644O7FLM6W7PE5FHLZRWB7SYS \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 97b9c77cab67adf793a53af31b07f2c48a410044b822197921209968f94d9c93
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "a9483e4f2a9cf675310f24a032f416d25b9131252d80a2286a9966dd07f9d022",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-02-03T01:31:55Z",
    "title_canon_sha256": "3618d2c65be29b341249d95664bd7f14d9028d16bf7c984a1900de44fc446b2c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2602.02977",
    "kind": "arxiv",
    "version": 2
  }
}