pith. sign in

arxiv: 2606.04180 · v1 · pith:F6ZJZXMDnew · submitted 2026-06-02 · 💻 cs.LG · cs.IT· math.IT

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

Pith reviewed 2026-06-28 10:47 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT
keywords vision-language modelsrepresentation comparisoncontrastive embedding clusteringkernel optimizationdiscrepancy analysismultimodal kernelsrepresentation alignmentCLIP
0
0 comments X

The pith

KODA identifies interpretable discrepancy directions in vision-language representations through constrained kernel optimization over sample subsets and modality interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to compare vision-language foundation models beyond downstream task performance by locating sample subsets that cluster weakly under one representation but strongly under another. It introduces KODA, which builds joint multimodal kernels and solves a constrained optimization to surface coherent structures in one model while suppressing them in a reference model. This approach produces directions tied to specific subsets and modality interactions that can guide representation alignment. A sympathetic reader would care because current evaluations leave unexplained how representations differ structurally, limiting targeted improvements in multimodal systems.

Core claim

KODA constructs unified multimodal kernels through modality-wise composition and formulates discrepancy discovery as a constrained optimization problem that searches for coherent structures in one representation while suppressing coherence in a reference representation, yielding interpretable discrepancy directions associated with specific sample subsets and modality interactions.

What carries the argument

KODA, a kernel-based framework that builds joint multimodal kernels and solves a constrained optimization to isolate coherent structures in one representation while suppressing them in another.

If this is right

  • KODA produces sample subsets that can be used directly for targeted representation alignment between models.
  • The method scales to large vision-language datasets via random projections and Random Fourier Features for joint kernels.
  • Discrepancy directions remain consistent across different vision-language models such as CLIP and SigLIP.
  • The framework supports identification of modality-specific interactions that drive representation differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the discrepancy directions prove stable under different kernel choices, KODA could serve as a diagnostic tool for auditing alignment in deployed multimodal systems.
  • The sample subsets surfaced by KODA might be used to construct targeted fine-tuning datasets that close specific representation gaps without full retraining.
  • Extending the joint kernel construction to additional modalities beyond vision and language could reveal cross-modal coherence patterns not captured by pairwise comparisons.

Load-bearing premise

The constrained optimization over joint kernels isolates structurally meaningful coherence differences rather than optimization artifacts or kernel-specific biases.

What would settle it

Running KODA on two representations known to differ only by random noise and finding that the resulting directions do not correspond to any consistent sample subsets or modality patterns.

Figures

Figures reproduced from arXiv: 2606.04180 by Farzan Farnia, Mohammad Jalali, Youqi Wu.

Figure 1
Figure 1. Figure 1: Overview of KODA for Contrastive Embedding Clustering, which aims to discover sample clusters that are represented differently by two embeddings. We show KODA-identified contrastive clusters for BLIP and CLIP embeddings on the MS-COCO dataset. several effects, rather than isolating subsets that are weakly grouped with respect to a specified reference embedding. In this work, we formulate Contrastive Embedd… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Visualization of the top-6 discrepancy directions that are strongly grouped under DINOv2 while being weakly clustered under CLIP on the FFHQ dataset, discovered by KODA. Right: t-SNE visualization of Top-10 directions together with clustering scores. 6. Numerical Results In this section, we evaluate KODA through two complemen￾tary tasks. The first task, contrastive embedding clustering, asks whether … view at source ↗
Figure 3
Figure 3. Figure 3: Multimodal discrepancy analysis on the MSCOCO dataset. Top: Representative image–caption pairs corresponding to the Top-1 discrepancy direction identified by KODA for different vision–language models relative to CLIP. Bottom: Generalized Rayleigh quotient of the identified discrepancy directions under varying constraint quantiles defined on the CLIP kernel. nant discrepancy directions that closely correspo… view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of KODA-selected samples before and after contrastive embedding alignment [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Kernel similarity heatmaps induced by CLIP and DINOv2 on the ImageNet-1k dog breeds, together with their difference. (scaled by 100 for better visualization.) Identified Discrepancy Directions Consistent with Kernel Difference Structures. Based on the above normalized RBF kernel difference between the two embeddings, we identify the dog-breed categories associated with the largest aggregated pairwise misma… view at source ↗
Figure 6
Figure 6. Figure 6: Consistency between dominant kernel mismatches (ground truth) and discrepancy directions identified by KODA on ImageNet dog breeds. Left: the kernel difference matrix between DINOv2 and CLIP computed using normalized RBF kernels. Middle: the top-3 ground-truth dog breeds associated with the largest aggregated mismatch scores in the difference matrix, together with representative images. Right: representati… view at source ↗
Figure 7
Figure 7. Figure 7: Top-10 DINOv2 dominant directions relative to CLIP on the AFHQ dataset identified by KODA, visualized via representative samples for each direction. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top-10 CLIP dominant directions relative to DINOv2 on the AFHQ dataset identified by KODA, visualized via representative samples for each direction. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: Visualization of the top-5 mismatch directions of DINOv2 and CLIP on the AFHQ dataset discovered by KODA (ours) and SPEC (baseline), respectively. Right: Generalized Rayleigh quotient x⊤K1x x⊤K2x w.r.t. the constraint on K2. (The quotient can be interpreted as a multiplicative measure of how strongly a given direction is represented in DINOv2 relative to CLIP.) C.4. Additional Results on Multimodal C… view at source ↗
Figure 10
Figure 10. Figure 10: Multimodal discrepancy analysis of SigLIP dominant directions relative to OpenCLIP on the MSCOCO dataset. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Multimodal discrepancy analysis of OpenCLIP dominant directions relative to SigLIP on the MSCOCO dataset. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Multimodal discrepancy analysis of BLIP dominant directions relative to CLIP on the MSCOCO dataset. Top: Representative image–caption pairs corresponding to the Top-3 discrepancy directions identified by KODA. Bottom: t-SNE visualization of Top-10 discrepancy directions using BLIP and CLIP embeddings respectively. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Multimodal discrepancy analysis of OpenCLIP dominant directions relative to CLIP on the MSCOCO dataset. Top: Representative image–caption pairs corresponding to the Top-3 discrepancy directions identified by KODA. Bottom: t-SNE visualization of Top-10 discrepancy directions using OpenCLIP and CLIP embeddings respectively. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Multimodal discrepancy analysis of CLIP dominant directions relative to OpenCLIP on the MSCOCO dataset. Top: Representative image–caption pairs corresponding to the Top-3 discrepancy directions identified by KODA. Bottom: t-SNE visualization of Top-10 discrepancy directions using CLIP and OpenCLIP embeddings respectively. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Multimodal discrepancy analysis of SigLIP dominant directions relative to CLIP on the MSCOCO dataset. Top: Representative image–caption pairs corresponding to the Top-3 discrepancy directions identified by KODA. Bottom: t-SNE visualization of Top-10 discrepancy directions using SigLIP and CLIP embeddings respectively. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Multimodal discrepancy analysis of SigLIP2 dominant directions relative to CLIP on the MSCOCO dataset. Top: Representative image–caption pairs corresponding to the Top-3 discrepancy directions identified by KODA. Bottom: t-SNE visualization of Top-10 discrepancy directions using SigLIP2 and CLIP embeddings respectively. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Multimodal discrepancy analysis of SigLIP dominant directions relative to OpenCLIP on the MSCOCO dataset. Top: Representative image–caption pairs corresponding to the Top-3 discrepancy directions identified by KODA. Bottom: t-SNE visualization of Top-10 discrepancy directions using SigLIP and OpenCLIP embeddings respectively. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Multimodal discrepancy analysis of SigLIP dominant directions relative to OpenCLIP on the MSCOCO dataset under different number of joint random fourier features. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Multimodal discrepancy analysis of OpenCLIP dominant directions relative to CLIP on the MSCOCO dataset under gaussian kernel function or cosine kernel function. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Multimodal discrepancy analysis of SigLIP dominant directions relative to OpenCLIP on the MSCOCO dataset under different number of sample size. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗
read the original abstract

Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not explain how their representations differ structurally. In this work, we study this problem through the task of Contrastive Embedding Clustering: identifying sample subsets that are weakly clustered under one representation but strongly clustered under another. We propose \emph{Kernel Optimization for Discrepancy Analysis (KODA)}, a kernel-based framework for contrastive representation comparison and alignment. KODA constructs unified multimodal kernels through modality-wise kernel composition and formulates discrepancy discovery as a constrained optimization problem that searches for coherent structures in one representation while suppressing coherence in a reference representation. This yields interpretable discrepancy directions associated with specific sample subsets and modality interactions. To scale KODA to large vision-language datasets, we develop randomized low-dimensional approximations of joint kernels using random projections, including Random Fourier Features for shift-invariant kernels. Empirically, KODA identifies consistent and interpretable discrepancy structures across vision-language representations and provides sample subsets for representation alignment. The code is available at https://github.com/yokiwuuu/KODA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces KODA, a kernel-based framework for contrastive representation comparison between vision-language models (e.g., CLIP, SigLIP). It constructs joint multimodal kernels via modality-wise composition, then casts discrepancy discovery as a constrained optimization that maximizes coherence in one embedding while suppressing it in a reference embedding. Randomized low-rank approximations (including Random Fourier Features) are used for scalability. The central claim is that this procedure yields interpretable discrepancy directions tied to specific sample subsets and modality interactions, with empirical results showing consistent structures and utility for representation alignment. Code is released.

Significance. If the recovered directions prove robust to kernel choice and projection artifacts, KODA would supply a useful interpretability tool that goes beyond downstream-task comparisons for multimodal representations. The explicit release of code supports reproducibility, which strengthens the contribution if the empirical claims hold under scrutiny.

major comments (2)
  1. [Abstract / Method] Abstract and method description: the claim that the constrained optimization 'yields interpretable discrepancy directions associated with specific sample subsets' is load-bearing, yet the manuscript provides no identifiability result, stability bound, or invariance analysis with respect to base-kernel choice, random-projection dimension, or Lagrange-multiplier enforcement. Without such guarantees, it remains possible that the directions reflect kernel composition or randomized approximation artifacts rather than intrinsic representation differences.
  2. [Abstract] Abstract: the statement 'Empirically, KODA identifies consistent and interpretable discrepancy structures' is presented without any reported quantitative metrics, ablation studies, error bars, or baseline comparisons. This absence directly undermines assessment of whether the optimization isolates structurally meaningful coherence differences.
minor comments (1)
  1. [Abstract] The GitHub link is provided, which is helpful for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on the theoretical grounding and empirical presentation of KODA. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: the claim that the constrained optimization 'yields interpretable discrepancy directions associated with specific sample subsets' is load-bearing, yet the manuscript provides no identifiability result, stability bound, or invariance analysis with respect to base-kernel choice, random-projection dimension, or Lagrange-multiplier enforcement. Without such guarantees, it remains possible that the directions reflect kernel composition or randomized approximation artifacts rather than intrinsic representation differences.

    Authors: We agree that formal identifiability, stability bounds, or invariance results with respect to kernel choice, projection dimension, and Lagrange enforcement are absent from the manuscript. The method is motivated by the explicit optimization objective of maximizing coherence differences, and we report empirical consistency across runs and models. In revision we will add a dedicated limitations subsection discussing potential approximation artifacts and include new experiments on stability under varying random-projection dimensions and kernel families. revision: partial

  2. Referee: [Abstract] Abstract: the statement 'Empirically, KODA identifies consistent and interpretable discrepancy structures' is presented without any reported quantitative metrics, ablation studies, error bars, or baseline comparisons. This absence directly undermines assessment of whether the optimization isolates structurally meaningful coherence differences.

    Authors: The abstract is intentionally concise. The full manuscript contains quantitative consistency metrics, ablation studies on kernel parameters and projection rank, and cross-model comparisons in the experimental section. We will revise the abstract to explicitly reference these quantitative results and ensure error bars are reported for all consistency measures. revision: yes

standing simulated objections not resolved
  • Formal identifiability result, stability bound, or invariance analysis with respect to base-kernel choice, random-projection dimension, or Lagrange-multiplier enforcement

Circularity Check

0 steps flagged

No circularity: KODA presented as independent constrained optimization over kernels

full rationale

The abstract and method description formulate discrepancy discovery as a constrained optimization over joint kernels constructed via modality-wise composition, with randomized approximations for scaling. No equations, self-citations, or claims are shown that reduce the recovered directions or coherence measures to fitted parameters, prior self-referential results, or definitions that embed the target output. The procedure is described as an external optimization task whose outputs (interpretable directions) are not forced by construction from the inputs, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters or invented entities; only high-level construction steps are stated.

axioms (1)
  • domain assumption Modality-wise kernel composition produces valid unified multimodal kernels.
    Invoked in the description of kernel construction.

pith-pipeline@v0.9.1-grok · 5739 in / 982 out tokens · 22703 ms · 2026-06-28T10:47:25.693562+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L

    URL https://proceedings.mlr.press/ v235/huh24a.html. Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L. Openclip, July 2021. URL https://doi.org/10. 5281/zenodo.5143773. If you use this software, please cite it as below. Jafari, D. an...

  2. [2]

    Karras, T., Laine, S., and Aila, T

    URL https://proceedings.mlr.press/ v139/jia21b.html. Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., and Sayres, R. I...

  3. [3]

    URL https://proceedings

    PMLR, 2018. URL https://proceedings. mlr.press/v80/kim18d.html. Koh, P. W., Nguyen, T., Tang, Y . S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. Concept bottleneck models. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pp. 5338–5348. PMLR,

  4. [4]

    DINOv2: Learning Robust Visual Features without Supervision

    URL https://proceedings.mlr.press/ v119/koh20a.html. Li, J., Li, D., Xiong, C., and Hoi, S. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pp. 12888–12900. PMLR, 2022. Li,...

  5. [5]

    Also, by Jensen’s inequality and the triangle inequality for the Frobenius norm, G F = E[Wi] F ≤E Wi F ≤2

    Thus, Wi F = z(Xi)z(Xi)⊤ F = z(Xi) 2 2 ≤2a.s. Also, by Jensen’s inequality and the triangle inequality for the Frobenius norm, G F = E[Wi] F ≤E Wi F ≤2. Therefore, Zi F = Wi −G F ≤ Wi F + G F ≤4a.s. We now apply the Hoeffding-type inequality for random vectors in Hilbert spaces (Sutherland et al., 2018) to the i.i.d. Hilbert-space-valued variables Zi in t...

  6. [6]

    However, note that Tr(bG) = 1 n nX i=1 Tr(z(Xi)z(Xi)⊤) = 1 n nX i=1 ∥z(Xi)∥2 2 ≤2 which holds deterministically because each ∥z(Xi)∥2 2 ≤2

    Similarly, we have bG⪰0 and ∥bG∥2 ≤Tr( bG). However, note that Tr(bG) = 1 n nX i=1 Tr(z(Xi)z(Xi)⊤) = 1 n nX i=1 ∥z(Xi)∥2 2 ≤2 which holds deterministically because each ∥z(Xi)∥2 2 ≤2 . Therefore, ∥bG∥2 ≤2 and ∥bG1/2∥2 ≤ √

  7. [7]

    On the event (26), we have∥bG−G∥ F ≤η n(δ)and hence ∥bBλ −B λ∥F ≤2 √ 2∥S λ∥2 D1/4 ηn(δ)1/2

    Consequently, the following holds ∥bG1/2∥2 +∥G 1/2∥2 ≤2 √ 2.(30) Then, we substitute (28) and (30) into (29): ∥bBλ −B λ∥F ≤2 √ 2∥S λ∥2 D1/4 ∥bG−G∥ 1/2 F . On the event (26), we have∥bG−G∥ F ≤η n(δ)and hence ∥bBλ −B λ∥F ≤2 √ 2∥S λ∥2 D1/4 ηn(δ)1/2. Plugging inη n(δ)from (26) leads to ∥bBλ −B λ∥F ≤2 √ 2∥S λ∥2 D1/4 4√n 1 + q 2 log 1 δ 1/2 = 8 √ 2∥S λ∥2 D1/4 n...

  8. [8]

    (Jalali et al., 2025a). Specifically, for each pair of embeddings under comparison, we tune the kernel bandwidths such that the leading eigenvalues of the resulting kernel matrices are of comparable magnitude across models, ensuring that neither embedding dominates the optimization due to scale differences. For the quadratic constraint in KODA, the thresh...

  9. [9]

    The man in a blue shirt is serving a tennisball.3.A male tennisplayer at the baseline of the court, serving the ball.4

    A tennisplayer is on a blue and green court.2. The man in a blue shirt is serving a tennisball.3.A male tennisplayer at the baseline of the court, serving the ball.4. A man hits a tennisball during a tennis game. Direction 1Direction 2Direction 3 imagescaptions BLIP-dominant directions relative to CLIP t-SNE visualization of directions BLIP CLIP

  10. [10]

    A baseballplayer bends down and a ball rolls behind him

    A baseballplayer sliding into a base on a baseball field.2. A baseballplayer bends down and a ball rolls behind him. 3. Some players in action on the baseballfield.4. A baseballplayer sliding into a base on a baseball field. Figure 12.Multimodal discrepancy analysis of BLIP dominant directions relative to CLIP on the MSCOCO dataset.Top:Representative imag...

  11. [11]

    A toiletwith a wooden seat next to a white sink.3

    A bathroomwith a white sink sitting next to a white bath tub.2. A toiletwith a wooden seat next to a white sink.3. A bathroomwith a sink, toilet, and a cabinet.4. a toilet sits inside of a cramped bathroom

  12. [12]

    Two zebrasare standing close in a field.3

    Two zebrasconfronting each other in a field with other zebras2. Two zebrasare standing close in a field.3. A pack of zebrastanding in a field next to an ostrich.4. A herd of zebragrazing on a lush green field

  13. [13]

    A pizzasitting on top of a white plate.3

    A metal plate with two pizzaswith toppings2. A pizzasitting on top of a white plate.3. A large sliced pizzaon a plate on a table.4. Two small whole pizzaon a tabletop alongside an empty plate Direction 1Direction 2Direction 3 imagescaptions CLIP-dominant directions relative to BLIP t-SNE visualization of directions CLIP BLIP Figure 13.Multimodal discrepan...

  14. [16]

    Two tall giraffestanding next to each other in a field

    A small giraffeis walking in his habitat2. Two tall giraffestanding next to each other in a field. 3. A couple of giraffesare standing in the wild. 4. A giraffewith its head cocked walking about a sandy area. Direction 1Direction 2Direction 3 imagescaptions OpenCLIP-dominant directions relative to CLIP t-SNE visualization of directions OpenCLIPCLIP Figure...

  15. [17]

    A bathroomwith a white toilet sitting next to a bath tub and a sink.3

    a bathroomwith a sink and a toilet in it2. A bathroomwith a white toilet sitting next to a bath tub and a sink.3. A bathroomwith mirror, sink, toilet and bathtub.4. A small toilet and tub in a little bathroom

  16. [18]

    A kitchenwith a stove, microwave, sink, and other kitchen items

    a small kitchenwith stainless steel appliances and wooden cabinets2. A kitchenwith a stove, microwave, sink, and other kitchen items. 3. A kitchencomplete with a stove, refrigerator and countertop.4. A kitchenwith a white stove and oven

  17. [19]

    A man riding skis down a snow covered slope.3

    A person riding a snowboard down a snow covered slope.2. A man riding skis down a snow covered slope.3. a person riding a snowboard on a snowy slope. 4. A man riding a pair of skis on top of a snow covered slope. Direction 1Direction 2Direction 3 imagescaptions CLIP-dominant directions relative to OpenCLIP t-SNE visualization of directions CLIP OpenCLIP F...

  18. [20]

    A man riding a pair of skison top of a snow coveredslope.3

    a person riding skison a snowy slope2. A man riding a pair of skison top of a snow coveredslope.3. a person riding skison a snowy slope 4. A man riding skison top of a snow coveredslope

  19. [21]

    A man jumping into the air with a skateboard.3

    A man flying through the air while riding a skateboard.2. A man jumping into the air with a skateboard.3. A man flying through the air while riding a skateboard.4. A man flying through the air while riding a skateboard. Direction 1Direction 2Direction 3 imagescaptions SigLIP-dominant directions relative to CLIP t-SNE visualization of directions SigLIP CLIP

  20. [22]

    A male surfer on a surfboard rides on top of a wave.3

    A man stands on his surfboard while surfinga small wave.2. A male surfer on a surfboard rides on top of a wave.3. A man rides a wave on a surfboard.4. a man on a surfboard rides on top of a wave Figure 16.Multimodal discrepancy analysis of SigLIP dominant directions relative to CLIP on the MSCOCO dataset.Top:Representative image–caption pairs correspondin...

  21. [23]

    A baseballplayer hitting a ball with a bat.3

    A baseballplayer swinging his bat at a baseball.2. A baseballplayer hitting a ball with a bat.3. A baseballplayer swinging a bat at a ball.4. A player at bat in a baseballgame in action. Direction 1Direction 2Direction 3 imagescaptions SigLIP2-dominant directions relative to CLIP t-SNE visualization of directions SigLIP2CLIP

  22. [24]

    A man standing on a tenniscourt holding a tennis racquet.3

    A man holding a tennis racquet on a tenniscourt.2. A man standing on a tenniscourt holding a tennis racquet.3. A man holding a tennis racquet on a tenniscourt.4. A man holding a tennis racquet on a tenniscourt. Figure 17.Multimodal discrepancy analysis of SigLIP2 dominant directions relative to CLIP on the MSCOCO dataset.Top:Representative image–caption p...

  23. [26]

    a person riding skison a snowy slope3

    a person riding skison a snowy slope2. a person riding skison a snowy slope3. A couple of people riding snowboardsdown a snow covered slope.4. A man riding skisdown a snow covered slope. Direction 1Direction 2Direction 3 imagescaptions SigLIP-dominant directions relative to OpenCLIP t-SNE visualization of directions SigLIP OpenCLIP

  24. [27]

    A baseballbatter is swinging a bat at an incoming pitch.3.A baseballplayer taking a swing at an incoming ball.4

    A baseballplayer at bat swinging at a pitch in a baseball game.2. A baseballbatter is swinging a bat at an incoming pitch.3.A baseballplayer taking a swing at an incoming ball.4. A baseballplayer swinging at a pitch at a game Figure 18.Multimodal discrepancy analysis of SigLIP dominant directions relative to OpenCLIP on the MSCOCO dataset.Top: Representat...

  25. [29]

    a person riding skison a snowy slope3

    a person riding skison a snowy slope2. a person riding skison a snowy slope3. A couple of people riding snowboardsdown a snow covered slope.4. A man riding skisdown a snow covered slope. Direction 1Direction 2Direction 3 imagescaptions1. A baseballplayer at bat swinging at a pitch in a baseball game.2. A baseballbatter is swinging a bat at an incoming pit...

  26. [30]

    A man riding a surfboard on a wave in the ocean.3

    A man riding a surfboard on a wave in the ocean.2. A man riding a surfboard on a wave in the ocean.3. A man riding a wave on top of a surfboard.4. A man riding a wave on top of a surfboard

  27. [31]

    A man riding a snowboard down a snow covered slope.3

    A man on a surf board rides a wave.2. A man riding a snowboard down a snow covered slope.3. A person skiing down a snow covered mountain slope.4. Someone riding waves on their surf board in the ocean. Direction 1Direction 2Direction 3 imagescaptions1. A baseball player at bat swinging at a pitch in a baseball game.2. A baseball player taking a swing at an...

  28. [32]

    A male surfer on a surf boardrides on top of a wave.3

    A person on a surfboard rides a wave.2. A male surfer on a surf boardrides on top of a wave.3. a man rides a surfboard on a wave4. A man is on his surfboard in the ocean water

  29. [33]

    A zebra eats grass with another zebra beside them and a third zebra nearby.3

    A herd of zebras is grazing in a grassy field.2. A zebra eats grass with another zebra beside them and a third zebra nearby.3. A man riding a snowboard down a snow covered slope.4. A number of giraffes mill about on the savanna. Direction 1Direction 2Direction 3 imagescaptions1. A baseball player getting ready to swing a bat.2. A baseball player swinging ...

  30. [34]

    The cat is behind the laptop screen on the desk.3

    A man sitting on a surfboard looking at the ocean.2. The cat is behind the laptop screen on the desk.3. a bath roomwith a toilet and towel racks4. Twp females walking on a tennis court carrying tennis racquets

  31. [35]

    A snow boarder going down a snowy slope.3

    Woman walking in restroom area with television picture on mirror.2. A snow boarder going down a snowy slope.3. a young couple having fun by a stop sign 4. A woman staying dry from the rain and holding an umbrella. Direction 1Direction 2Direction 3 imagescaptions1. People are gathering at a table for a seminar2. A man sitting in front of a laptop computer ...

  32. [36]

    A man flying through the air riding a skateboard

    a man on a skate board does a trick in the air 2. A man flying through the air riding a skateboard. 3. A person on a skateboardup in the air. 4. A young man riding a skateboardup the side of a ramp

  33. [37]

    a surferriding a small wave in the ocean3

    A man on a surfboardis riding the wave2. a surferriding a small wave in the ocean3. The man is surfinghigh up on a wave. 4. A surferrides a wave in the ocean

  34. [38]

    Two tall giraffestanding next to each other in a field

    A small giraffeis walking in his habitat2. Two tall giraffestanding next to each other in a field. 3. A couple of giraffesare standing in the wild. 4. A giraffewith its head cocked walking about a sandy area. Direction 1Direction 2Direction 3 imagescaptions OpenCLIP CLIP OpenCLIP-dominant directions relative to CLIP

  35. [39]

    A dark picture of a very clean dark colored kitchen

    A picture of a very nice kitchenthat is white.2. A dark picture of a very clean dark colored kitchen. 3. A very clean kitchenthat is in a house.4. A kitchenis in need of being demolished because of its condition

  36. [40]

    A bus driving down a city streetduring the day.3

    A minivan is in an intersection with the trafficlights showing red.2. A bus driving down a city streetduring the day.3. A highway filled with lots of trafficwith a train traveling over a bridge.4. A blue train travelingover a red rail bridge over cars

  37. [41]

    A baseballmitt and glove are laying in a field.3

    A baseballglove with a baseball inside and a bat on a table 2. A baseballmitt and glove are laying in a field.3. A baseballbat, ball and glove laying on a playing field4. A man swinging a baseballbat at a ball during a game. Direction 1Direction 2Direction 3 imagescaptions OpenCLIP CLIP Cosine Kernel Gaussian Kernel Figure 20.Multimodal discrepancy analys...

  38. [42]

    A man riding a waveon top of a surfboard.3

    A person riding a waveon a surfboard.2. A man riding a waveon top of a surfboard.3. A man riding a waveon top of a surfboard.4. A man riding a surfboard on top of a wave

  39. [43]

    a person riding skison a snowy slope3

    a person riding skison a snowy slope2. a person riding skison a snowy slope3. A couple of people riding snowboardsdown a snow covered slope.4. A man riding skisdown a snow covered slope. Direction 1Direction 2Direction 3 imagescaptions1. A baseballplayer at bat swinging at a pitch in a baseball game.2. A baseballbatter is swinging a bat at an incoming pit...

  40. [44]

    A surfer riding a wave in the ocean3

    A surfer is riding a wave in the ocean.2. A surfer riding a wave in the ocean3. A surfer is on his board in the middle of an ocean spraying wave.4. A man surfing waves on his surf boardin the ocean

  41. [45]

    A bathroom with a white toilet next to a sink.3

    A baseball player swinging a bat on top of a field.2. A bathroom with a white toilet next to a sink.3. A baseball player swinging a bat on top of a baseball field.4. A white toilet sitting next to a white sink in a bathroom. Direction 1Direction 2Direction 3 imagescaptions1. A baseball player hits a ball during a game.2.a batter swinging a bat at a ball a...

  42. [46]

    Guy doing a flip trick with his skateboard at the park3

    there is a man on a skate boarddoing a trick2. Guy doing a flip trick with his skateboard at the park3. A surfer riding a wave in the ocean 4. A surfer carrying his surf boardout of the ocean

  43. [47]

    A tennis player goes to hit the ball3

    A man riding a surfboard on a wave in the ocean.2. A tennis player goes to hit the ball3. a tennis player rushing to the net to hit the ball4. A man riding a wave on a surfboard. Direction 1Direction 2Direction 3 imagescaptions1. a batter swinging a bat at a ball at a baseball game2. A baseball player ready to swing at a baseball game.3. A baseball player...

  44. [48]

    A large giraffe eating leaves in an enclosure3

    A red train parked in front of a loading platform next to passengers.2. A large giraffe eating leaves in an enclosure3. a very cluttered bathroom with a cat in the sink4. A giraffe eating food from the top of the tree

  45. [49]

    A couple of guys wearing skis and a snowboard.3

    A toilet, sink, mirror, and tub in a bathroom.2. A couple of guys wearing skis and a snowboard.3. A bathroom area of plane with a sink and toilet.4. A clean sink is in the middle of the counter. Direction 1Direction 2Direction 3 imagescaptions1. A man that is on a curb with a skateboard.2. A very large city sitting along side of a large body of water.3. A...