Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

Byeonghyun Pak; Byeongju Woo; Sangwoo Mo; Stella X. Yu; Zilin Wang

arxiv: 2602.02977 · v2 · pith:S644O7FLnew · submitted 2026-02-03 · 💻 cs.CV · cs.AI· cs.LG

Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

Byeongju Woo , Zilin Wang , Byeonghyun Pak , Sangwoo Mo , Stella X. Yu This is my paper

Pith reviewed 2026-05-16 08:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords vision-language modelslong captionshierarchical alignmentimage-text retrievalfine-grained understandingpart-whole compositioncross-domain alignmentlocalized semantics

0 comments

The pith

CAFT aligns local descriptions in long captions to image regions before forming global scene representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models like CLIP often overlook fine details in long captions by relying on dominant scene cues. The paper proposes a hierarchical principle where models must first uncover semantic parts in the image before composing whole-scene understanding. CAFT implements this with a fine-to-coarse image encoder and part-whole text encoder that jointly optimize local text-region alignments and global image-text matching. If the approach holds, it produces fine-grained representations that localize textual semantics without any region-level labels or supervision. Experiments on 30 million image-text pairs confirm state-of-the-art results on six long-text retrieval benchmarks plus clear scaling gains.

Core claim

CAFT jointly learns local text-region alignment at intermediate representations and global image-text alignment at the final representation by exploiting the organization of long captions where local descriptions correspond to scene parts, using a fine-to-coarse image encoder and a part-whole text encoder to discover localized part semantics and progressively compose them into a global image-text representation.

What carries the argument

CAFT, which uses a fine-to-coarse image encoder together with a part-whole text encoder to discover localized part semantics from long captions and compose them into global image-text representations.

If this is right

Achieves state-of-the-art performance on six long-text retrieval benchmarks after training on 30 million image-text pairs.
Exhibits strong scaling behavior with increases in model size and training data.
Learns fine-grained representations that localize textual semantics in image regions without explicit region-level supervision.
Enables models to treat scenes as explicit part-to-whole compositions rather than single global embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hierarchical principle could be applied to video or audio with long descriptive transcripts to discover local event alignments.
Downstream tasks such as detailed visual question answering may benefit from the localized representations without additional annotation cost.
Training pipelines could shift away from expensive region-level labels toward caption-driven localization at scale.

Load-bearing premise

Long captions naturally contain local descriptions that correspond to distinct scene parts, allowing the model to discover localized alignments without any region-level supervision or explicit part annotations.

What would settle it

Train an otherwise identical model without the local alignment objective on a dataset of long captions that deliberately lack corresponding part descriptions, then measure whether retrieval accuracy on the six benchmarks falls to the level of standard global-alignment baselines.

read the original abstract

Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose a hierarchical vision-language learning principle for understanding scenes as part-to-whole compositions: before forming a whole-scene representation, a model should uncover what semantic parts appear where in the image. To this end, we propose CAFT (Cross-domain Alignment of Forests and Trees), a vision-language model that jointly learns local text-region alignment at intermediate representations and global image-text alignment at the final representation. Exploiting the organization of long captions, where local descriptions often correspond to scene parts, CAFT employs a fine-to-coarse image encoder and a part-whole text encoder to discover localized part semantics and progressively compose them into a global image-text representation. Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that CAFT learns fine-grained representations that localize textual semantics in image regions without explicit region-level supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAFT adds a hierarchical fine-to-coarse image encoder and part-whole text encoder to handle long captions better than standard CLIP, with claimed SOTA retrieval results, but the evidence tying gains to the part-to-whole principle is still thin.

read the letter

The main takeaway is that this paper introduces CAFT to address CLIP's tendency to miss fine details in long captions by building in explicit hierarchical alignment: a fine-to-coarse image encoder that processes local regions first and a part-whole text encoder that composes local descriptions into global ones. Trained on 30M pairs, it reports state-of-the-art results on six long-text retrieval benchmarks plus scaling behavior and some localization of text semantics to image areas without region supervision. That architectural choice is the clearest new element compared to flat contrastive models. It does a solid job framing the practical problem around detailed scene understanding and showing that the model can pick up localized alignments from caption structure alone. The scaling note is also useful as a signal that the approach might not be brittle. The soft spots sit in the validation. The abstract gives no numbers, no ablation tables, and no error analysis, so it is difficult to judge how much the hierarchical design actually contributes versus model capacity or training specifics. The central assumption that long captions reliably break into distinct local part descriptions that map cleanly to image regions is stated but not directly tested with evidence like attention visualizations or controlled breakdowns. If the captions are mostly global or correlated, the reported localization could be an artifact rather than a result of the part-to-whole principle. This work is aimed at researchers working on multimodal grounding and retrieval for complex scenes. Readers looking for concrete ideas on hierarchical encoders will find something usable even if they have to implement the details themselves. It deserves a serious referee because the problem is real and the claimed results are relevant, even if the current write-up needs more quantitative checks to stand up. I would send it to peer review rather than desk reject.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes CAFT, a hierarchical vision-language model using a fine-to-coarse image encoder and part-whole text encoder to jointly learn local text-region alignments and global image-text alignments. It exploits the natural organization of long captions to discover localized part semantics without explicit region-level supervision, trains on 30M image-text pairs, and claims state-of-the-art performance on six long-text retrieval benchmarks plus strong scaling behavior and fine-grained localization.

Significance. If the localization mechanism and SOTA gains are substantiated, the part-to-whole hierarchical principle could meaningfully advance fine-grained visually grounded understanding in VLMs beyond standard contrastive global alignment, particularly for detail-rich captions.

major comments (3)

[Abstract] Abstract: reports SOTA results on six long-text retrieval benchmarks and scaling behavior but supplies no quantitative numbers, ablation studies, error analysis, or baseline comparisons, preventing verification that the hierarchical structure (rather than capacity or training scale) drives the gains.
[§3] §3 (Methods): the fine-to-coarse image encoder and part-whole text encoder are described only at a conceptual level; no equations define the local contrastive alignment loss, the global alignment loss, or the progressive composition mechanism, and no implementation details (e.g., how intermediate representations are extracted or aligned) are given.
[§4] §4 (Experiments): full data splits, training hyperparameters, and quantitative results tables are absent, so it is impossible to assess reproducibility or whether the reported localization of textual semantics in image regions is actually achieved by the part-to-whole principle rather than global cues.

minor comments (1)

Figure captions and notation for the hierarchical encoders could be clarified with an explicit diagram showing the flow from local to global representations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We will revise the manuscript to incorporate quantitative results, mathematical formulations, and complete experimental details as suggested. These updates will strengthen the presentation of our hierarchical alignment approach without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: reports SOTA results on six long-text retrieval benchmarks and scaling behavior but supplies no quantitative numbers, ablation studies, error analysis, or baseline comparisons, preventing verification that the hierarchical structure (rather than capacity or training scale) drives the gains.

Authors: We agree that the abstract would benefit from key quantitative indicators. In the revised version, we will add specific metrics such as recall@1 gains on the six benchmarks (e.g., improvements over CLIP and other baselines) and a brief reference to ablation results isolating the hierarchical component. Full error analysis and exhaustive baseline tables will be expanded in Section 4 and the supplement due to abstract length constraints. revision: yes
Referee: [§3] §3 (Methods): the fine-to-coarse image encoder and part-whole text encoder are described only at a conceptual level; no equations define the local contrastive alignment loss, the global alignment loss, or the progressive composition mechanism, and no implementation details (e.g., how intermediate representations are extracted or aligned) are given.

Authors: We acknowledge that the methods section in the submitted manuscript remained at a conceptual level. We will add the explicit loss equations for local contrastive alignment (L_local), global alignment (L_global), and the progressive composition operator, along with implementation specifics on intermediate feature extraction from the fine-to-coarse encoder and cross-domain alignment steps. Diagrams and pseudocode will also be included for reproducibility. revision: yes
Referee: [§4] §4 (Experiments): full data splits, training hyperparameters, and quantitative results tables are absent, so it is impossible to assess reproducibility or whether the reported localization of textual semantics in image regions is actually achieved by the part-to-whole principle rather than global cues.

Authors: We agree that complete experimental details are essential. The revised manuscript will include the precise training data splits from the 30M pairs, all hyperparameters (batch size, learning rates, temperature, epochs), and full quantitative tables with baseline comparisons. To substantiate that localization arises from the part-to-whole mechanism, we will add targeted ablations (e.g., ablating local alignment) and additional localization metrics/visualizations demonstrating gains beyond global cues. revision: yes

Circularity Check

0 steps flagged

No significant circularity in hierarchical contrastive alignment

full rationale

The paper introduces CAFT via a fine-to-coarse image encoder and part-whole text encoder that apply standard contrastive objectives to long captions for local-to-global alignment. No equations or derivations are shown that reduce to fitted parameters by construction, nor are there self-citation chains or uniqueness theorems invoked to force the architecture. Performance on six retrieval benchmarks after training on 30M pairs constitutes independent empirical evidence rather than a self-referential prediction. The approach is self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that long captions decompose into local part descriptions that align with image regions, plus standard VLM training assumptions.

axioms (1)

domain assumption Long captions contain local descriptions that correspond to distinct scene parts
Invoked to justify the part-whole text encoder and local alignment objective.

pith-pipeline@v0.9.0 · 5503 in / 1100 out tokens · 31274 ms · 2026-05-16T08:42:39.275338+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CAFT employs a fine-to-coarse image encoder and a part-whole text encoder... hierarchical alignment loss that matches whole images with whole captions while biasing region-sentence correspondences
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-to-coarse visual encoder that progressively clusters fine-grained tokens into coarser segment tokens based on semantic similarity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vision Harnessing Agent for Open Ad-hoc Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.