KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models
Pith reviewed 2026-06-28 10:47 UTC · model grok-4.3
The pith
KODA identifies interpretable discrepancy directions in vision-language representations through constrained kernel optimization over sample subsets and modality interactions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KODA constructs unified multimodal kernels through modality-wise composition and formulates discrepancy discovery as a constrained optimization problem that searches for coherent structures in one representation while suppressing coherence in a reference representation, yielding interpretable discrepancy directions associated with specific sample subsets and modality interactions.
What carries the argument
KODA, a kernel-based framework that builds joint multimodal kernels and solves a constrained optimization to isolate coherent structures in one representation while suppressing them in another.
If this is right
- KODA produces sample subsets that can be used directly for targeted representation alignment between models.
- The method scales to large vision-language datasets via random projections and Random Fourier Features for joint kernels.
- Discrepancy directions remain consistent across different vision-language models such as CLIP and SigLIP.
- The framework supports identification of modality-specific interactions that drive representation differences.
Where Pith is reading between the lines
- If the discrepancy directions prove stable under different kernel choices, KODA could serve as a diagnostic tool for auditing alignment in deployed multimodal systems.
- The sample subsets surfaced by KODA might be used to construct targeted fine-tuning datasets that close specific representation gaps without full retraining.
- Extending the joint kernel construction to additional modalities beyond vision and language could reveal cross-modal coherence patterns not captured by pairwise comparisons.
Load-bearing premise
The constrained optimization over joint kernels isolates structurally meaningful coherence differences rather than optimization artifacts or kernel-specific biases.
What would settle it
Running KODA on two representations known to differ only by random noise and finding that the resulting directions do not correspond to any consistent sample subsets or modality patterns.
Figures
read the original abstract
Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not explain how their representations differ structurally. In this work, we study this problem through the task of Contrastive Embedding Clustering: identifying sample subsets that are weakly clustered under one representation but strongly clustered under another. We propose \emph{Kernel Optimization for Discrepancy Analysis (KODA)}, a kernel-based framework for contrastive representation comparison and alignment. KODA constructs unified multimodal kernels through modality-wise kernel composition and formulates discrepancy discovery as a constrained optimization problem that searches for coherent structures in one representation while suppressing coherence in a reference representation. This yields interpretable discrepancy directions associated with specific sample subsets and modality interactions. To scale KODA to large vision-language datasets, we develop randomized low-dimensional approximations of joint kernels using random projections, including Random Fourier Features for shift-invariant kernels. Empirically, KODA identifies consistent and interpretable discrepancy structures across vision-language representations and provides sample subsets for representation alignment. The code is available at https://github.com/yokiwuuu/KODA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces KODA, a kernel-based framework for contrastive representation comparison between vision-language models (e.g., CLIP, SigLIP). It constructs joint multimodal kernels via modality-wise composition, then casts discrepancy discovery as a constrained optimization that maximizes coherence in one embedding while suppressing it in a reference embedding. Randomized low-rank approximations (including Random Fourier Features) are used for scalability. The central claim is that this procedure yields interpretable discrepancy directions tied to specific sample subsets and modality interactions, with empirical results showing consistent structures and utility for representation alignment. Code is released.
Significance. If the recovered directions prove robust to kernel choice and projection artifacts, KODA would supply a useful interpretability tool that goes beyond downstream-task comparisons for multimodal representations. The explicit release of code supports reproducibility, which strengthens the contribution if the empirical claims hold under scrutiny.
major comments (2)
- [Abstract / Method] Abstract and method description: the claim that the constrained optimization 'yields interpretable discrepancy directions associated with specific sample subsets' is load-bearing, yet the manuscript provides no identifiability result, stability bound, or invariance analysis with respect to base-kernel choice, random-projection dimension, or Lagrange-multiplier enforcement. Without such guarantees, it remains possible that the directions reflect kernel composition or randomized approximation artifacts rather than intrinsic representation differences.
- [Abstract] Abstract: the statement 'Empirically, KODA identifies consistent and interpretable discrepancy structures' is presented without any reported quantitative metrics, ablation studies, error bars, or baseline comparisons. This absence directly undermines assessment of whether the optimization isolates structurally meaningful coherence differences.
minor comments (1)
- [Abstract] The GitHub link is provided, which is helpful for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the theoretical grounding and empirical presentation of KODA. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: the claim that the constrained optimization 'yields interpretable discrepancy directions associated with specific sample subsets' is load-bearing, yet the manuscript provides no identifiability result, stability bound, or invariance analysis with respect to base-kernel choice, random-projection dimension, or Lagrange-multiplier enforcement. Without such guarantees, it remains possible that the directions reflect kernel composition or randomized approximation artifacts rather than intrinsic representation differences.
Authors: We agree that formal identifiability, stability bounds, or invariance results with respect to kernel choice, projection dimension, and Lagrange enforcement are absent from the manuscript. The method is motivated by the explicit optimization objective of maximizing coherence differences, and we report empirical consistency across runs and models. In revision we will add a dedicated limitations subsection discussing potential approximation artifacts and include new experiments on stability under varying random-projection dimensions and kernel families. revision: partial
-
Referee: [Abstract] Abstract: the statement 'Empirically, KODA identifies consistent and interpretable discrepancy structures' is presented without any reported quantitative metrics, ablation studies, error bars, or baseline comparisons. This absence directly undermines assessment of whether the optimization isolates structurally meaningful coherence differences.
Authors: The abstract is intentionally concise. The full manuscript contains quantitative consistency metrics, ablation studies on kernel parameters and projection rank, and cross-model comparisons in the experimental section. We will revise the abstract to explicitly reference these quantitative results and ensure error bars are reported for all consistency measures. revision: yes
- Formal identifiability result, stability bound, or invariance analysis with respect to base-kernel choice, random-projection dimension, or Lagrange-multiplier enforcement
Circularity Check
No circularity: KODA presented as independent constrained optimization over kernels
full rationale
The abstract and method description formulate discrepancy discovery as a constrained optimization over joint kernels constructed via modality-wise composition, with randomized approximations for scaling. No equations, self-citations, or claims are shown that reduce the recovered directions or coherence measures to fitted parameters, prior self-referential results, or definitions that embed the target output. The procedure is described as an external optimization task whose outputs (interpretable directions) are not forced by construction from the inputs, satisfying the criteria for a self-contained derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modality-wise kernel composition produces valid unified multimodal kernels.
Reference graph
Works this paper leans on
-
[1]
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L
URL https://proceedings.mlr.press/ v235/huh24a.html. Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L. Openclip, July 2021. URL https://doi.org/10. 5281/zenodo.5143773. If you use this software, please cite it as below. Jafari, D. an...
2021
-
[2]
Karras, T., Laine, S., and Aila, T
URL https://proceedings.mlr.press/ v139/jia21b.html. Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., and Sayres, R. I...
2019
-
[3]
URL https://proceedings
PMLR, 2018. URL https://proceedings. mlr.press/v80/kim18d.html. Koh, P. W., Nguyen, T., Tang, Y . S., Mussmann, S., Pierson, E., Kim, B., and Liang, P. Concept bottleneck models. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pp. 5338–5348. PMLR,
2018
-
[4]
DINOv2: Learning Robust Visual Features without Supervision
URL https://proceedings.mlr.press/ v119/koh20a.html. Li, J., Li, D., Xiong, C., and Hoi, S. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pp. 12888–12900. PMLR, 2022. Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Also, by Jensen’s inequality and the triangle inequality for the Frobenius norm, G F = E[Wi] F ≤E Wi F ≤2
Thus, Wi F = z(Xi)z(Xi)⊤ F = z(Xi) 2 2 ≤2a.s. Also, by Jensen’s inequality and the triangle inequality for the Frobenius norm, G F = E[Wi] F ≤E Wi F ≤2. Therefore, Zi F = Wi −G F ≤ Wi F + G F ≤4a.s. We now apply the Hoeffding-type inequality for random vectors in Hilbert spaces (Sutherland et al., 2018) to the i.i.d. Hilbert-space-valued variables Zi in t...
2018
-
[6]
However, note that Tr(bG) = 1 n nX i=1 Tr(z(Xi)z(Xi)⊤) = 1 n nX i=1 ∥z(Xi)∥2 2 ≤2 which holds deterministically because each ∥z(Xi)∥2 2 ≤2
Similarly, we have bG⪰0 and ∥bG∥2 ≤Tr( bG). However, note that Tr(bG) = 1 n nX i=1 Tr(z(Xi)z(Xi)⊤) = 1 n nX i=1 ∥z(Xi)∥2 2 ≤2 which holds deterministically because each ∥z(Xi)∥2 2 ≤2 . Therefore, ∥bG∥2 ≤2 and ∥bG1/2∥2 ≤ √
-
[7]
On the event (26), we have∥bG−G∥ F ≤η n(δ)and hence ∥bBλ −B λ∥F ≤2 √ 2∥S λ∥2 D1/4 ηn(δ)1/2
Consequently, the following holds ∥bG1/2∥2 +∥G 1/2∥2 ≤2 √ 2.(30) Then, we substitute (28) and (30) into (29): ∥bBλ −B λ∥F ≤2 √ 2∥S λ∥2 D1/4 ∥bG−G∥ 1/2 F . On the event (26), we have∥bG−G∥ F ≤η n(δ)and hence ∥bBλ −B λ∥F ≤2 √ 2∥S λ∥2 D1/4 ηn(δ)1/2. Plugging inη n(δ)from (26) leads to ∥bBλ −B λ∥F ≤2 √ 2∥S λ∥2 D1/4 4√n 1 + q 2 log 1 δ 1/2 = 8 √ 2∥S λ∥2 D1/4 n...
2018
-
[8]
(Jalali et al., 2025a). Specifically, for each pair of embeddings under comparison, we tune the kernel bandwidths such that the leading eigenvalues of the resulting kernel matrices are of comparable magnitude across models, ensuring that neither embedding dominates the optimization due to scale differences. For the quadratic constraint in KODA, the thresh...
2020
-
[9]
The man in a blue shirt is serving a tennisball.3.A male tennisplayer at the baseline of the court, serving the ball.4
A tennisplayer is on a blue and green court.2. The man in a blue shirt is serving a tennisball.3.A male tennisplayer at the baseline of the court, serving the ball.4. A man hits a tennisball during a tennis game. Direction 1Direction 2Direction 3 imagescaptions BLIP-dominant directions relative to CLIP t-SNE visualization of directions BLIP CLIP
-
[10]
A baseballplayer bends down and a ball rolls behind him
A baseballplayer sliding into a base on a baseball field.2. A baseballplayer bends down and a ball rolls behind him. 3. Some players in action on the baseballfield.4. A baseballplayer sliding into a base on a baseball field. Figure 12.Multimodal discrepancy analysis of BLIP dominant directions relative to CLIP on the MSCOCO dataset.Top:Representative imag...
-
[11]
A toiletwith a wooden seat next to a white sink.3
A bathroomwith a white sink sitting next to a white bath tub.2. A toiletwith a wooden seat next to a white sink.3. A bathroomwith a sink, toilet, and a cabinet.4. a toilet sits inside of a cramped bathroom
-
[12]
Two zebrasare standing close in a field.3
Two zebrasconfronting each other in a field with other zebras2. Two zebrasare standing close in a field.3. A pack of zebrastanding in a field next to an ostrich.4. A herd of zebragrazing on a lush green field
-
[13]
A pizzasitting on top of a white plate.3
A metal plate with two pizzaswith toppings2. A pizzasitting on top of a white plate.3. A large sliced pizzaon a plate on a table.4. Two small whole pizzaon a tabletop alongside an empty plate Direction 1Direction 2Direction 3 imagescaptions CLIP-dominant directions relative to BLIP t-SNE visualization of directions CLIP BLIP Figure 13.Multimodal discrepan...
-
[16]
Two tall giraffestanding next to each other in a field
A small giraffeis walking in his habitat2. Two tall giraffestanding next to each other in a field. 3. A couple of giraffesare standing in the wild. 4. A giraffewith its head cocked walking about a sandy area. Direction 1Direction 2Direction 3 imagescaptions OpenCLIP-dominant directions relative to CLIP t-SNE visualization of directions OpenCLIPCLIP Figure...
-
[17]
A bathroomwith a white toilet sitting next to a bath tub and a sink.3
a bathroomwith a sink and a toilet in it2. A bathroomwith a white toilet sitting next to a bath tub and a sink.3. A bathroomwith mirror, sink, toilet and bathtub.4. A small toilet and tub in a little bathroom
-
[18]
A kitchenwith a stove, microwave, sink, and other kitchen items
a small kitchenwith stainless steel appliances and wooden cabinets2. A kitchenwith a stove, microwave, sink, and other kitchen items. 3. A kitchencomplete with a stove, refrigerator and countertop.4. A kitchenwith a white stove and oven
-
[19]
A man riding skis down a snow covered slope.3
A person riding a snowboard down a snow covered slope.2. A man riding skis down a snow covered slope.3. a person riding a snowboard on a snowy slope. 4. A man riding a pair of skis on top of a snow covered slope. Direction 1Direction 2Direction 3 imagescaptions CLIP-dominant directions relative to OpenCLIP t-SNE visualization of directions CLIP OpenCLIP F...
-
[20]
A man riding a pair of skison top of a snow coveredslope.3
a person riding skison a snowy slope2. A man riding a pair of skison top of a snow coveredslope.3. a person riding skison a snowy slope 4. A man riding skison top of a snow coveredslope
-
[21]
A man jumping into the air with a skateboard.3
A man flying through the air while riding a skateboard.2. A man jumping into the air with a skateboard.3. A man flying through the air while riding a skateboard.4. A man flying through the air while riding a skateboard. Direction 1Direction 2Direction 3 imagescaptions SigLIP-dominant directions relative to CLIP t-SNE visualization of directions SigLIP CLIP
-
[22]
A male surfer on a surfboard rides on top of a wave.3
A man stands on his surfboard while surfinga small wave.2. A male surfer on a surfboard rides on top of a wave.3. A man rides a wave on a surfboard.4. a man on a surfboard rides on top of a wave Figure 16.Multimodal discrepancy analysis of SigLIP dominant directions relative to CLIP on the MSCOCO dataset.Top:Representative image–caption pairs correspondin...
-
[23]
A baseballplayer hitting a ball with a bat.3
A baseballplayer swinging his bat at a baseball.2. A baseballplayer hitting a ball with a bat.3. A baseballplayer swinging a bat at a ball.4. A player at bat in a baseballgame in action. Direction 1Direction 2Direction 3 imagescaptions SigLIP2-dominant directions relative to CLIP t-SNE visualization of directions SigLIP2CLIP
-
[24]
A man standing on a tenniscourt holding a tennis racquet.3
A man holding a tennis racquet on a tenniscourt.2. A man standing on a tenniscourt holding a tennis racquet.3. A man holding a tennis racquet on a tenniscourt.4. A man holding a tennis racquet on a tenniscourt. Figure 17.Multimodal discrepancy analysis of SigLIP2 dominant directions relative to CLIP on the MSCOCO dataset.Top:Representative image–caption p...
-
[26]
a person riding skison a snowy slope3
a person riding skison a snowy slope2. a person riding skison a snowy slope3. A couple of people riding snowboardsdown a snow covered slope.4. A man riding skisdown a snow covered slope. Direction 1Direction 2Direction 3 imagescaptions SigLIP-dominant directions relative to OpenCLIP t-SNE visualization of directions SigLIP OpenCLIP
-
[27]
A baseballbatter is swinging a bat at an incoming pitch.3.A baseballplayer taking a swing at an incoming ball.4
A baseballplayer at bat swinging at a pitch in a baseball game.2. A baseballbatter is swinging a bat at an incoming pitch.3.A baseballplayer taking a swing at an incoming ball.4. A baseballplayer swinging at a pitch at a game Figure 18.Multimodal discrepancy analysis of SigLIP dominant directions relative to OpenCLIP on the MSCOCO dataset.Top: Representat...
2000
-
[29]
a person riding skison a snowy slope3
a person riding skison a snowy slope2. a person riding skison a snowy slope3. A couple of people riding snowboardsdown a snow covered slope.4. A man riding skisdown a snow covered slope. Direction 1Direction 2Direction 3 imagescaptions1. A baseballplayer at bat swinging at a pitch in a baseball game.2. A baseballbatter is swinging a bat at an incoming pit...
-
[30]
A man riding a surfboard on a wave in the ocean.3
A man riding a surfboard on a wave in the ocean.2. A man riding a surfboard on a wave in the ocean.3. A man riding a wave on top of a surfboard.4. A man riding a wave on top of a surfboard
-
[31]
A man riding a snowboard down a snow covered slope.3
A man on a surf board rides a wave.2. A man riding a snowboard down a snow covered slope.3. A person skiing down a snow covered mountain slope.4. Someone riding waves on their surf board in the ocean. Direction 1Direction 2Direction 3 imagescaptions1. A baseball player at bat swinging at a pitch in a baseball game.2. A baseball player taking a swing at an...
2000
-
[32]
A male surfer on a surf boardrides on top of a wave.3
A person on a surfboard rides a wave.2. A male surfer on a surf boardrides on top of a wave.3. a man rides a surfboard on a wave4. A man is on his surfboard in the ocean water
-
[33]
A zebra eats grass with another zebra beside them and a third zebra nearby.3
A herd of zebras is grazing in a grassy field.2. A zebra eats grass with another zebra beside them and a third zebra nearby.3. A man riding a snowboard down a snow covered slope.4. A number of giraffes mill about on the savanna. Direction 1Direction 2Direction 3 imagescaptions1. A baseball player getting ready to swing a bat.2. A baseball player swinging ...
-
[34]
The cat is behind the laptop screen on the desk.3
A man sitting on a surfboard looking at the ocean.2. The cat is behind the laptop screen on the desk.3. a bath roomwith a toilet and towel racks4. Twp females walking on a tennis court carrying tennis racquets
-
[35]
A snow boarder going down a snowy slope.3
Woman walking in restroom area with television picture on mirror.2. A snow boarder going down a snowy slope.3. a young couple having fun by a stop sign 4. A woman staying dry from the rain and holding an umbrella. Direction 1Direction 2Direction 3 imagescaptions1. People are gathering at a table for a seminar2. A man sitting in front of a laptop computer ...
-
[36]
A man flying through the air riding a skateboard
a man on a skate board does a trick in the air 2. A man flying through the air riding a skateboard. 3. A person on a skateboardup in the air. 4. A young man riding a skateboardup the side of a ramp
-
[37]
a surferriding a small wave in the ocean3
A man on a surfboardis riding the wave2. a surferriding a small wave in the ocean3. The man is surfinghigh up on a wave. 4. A surferrides a wave in the ocean
-
[38]
Two tall giraffestanding next to each other in a field
A small giraffeis walking in his habitat2. Two tall giraffestanding next to each other in a field. 3. A couple of giraffesare standing in the wild. 4. A giraffewith its head cocked walking about a sandy area. Direction 1Direction 2Direction 3 imagescaptions OpenCLIP CLIP OpenCLIP-dominant directions relative to CLIP
-
[39]
A dark picture of a very clean dark colored kitchen
A picture of a very nice kitchenthat is white.2. A dark picture of a very clean dark colored kitchen. 3. A very clean kitchenthat is in a house.4. A kitchenis in need of being demolished because of its condition
-
[40]
A bus driving down a city streetduring the day.3
A minivan is in an intersection with the trafficlights showing red.2. A bus driving down a city streetduring the day.3. A highway filled with lots of trafficwith a train traveling over a bridge.4. A blue train travelingover a red rail bridge over cars
-
[41]
A baseballmitt and glove are laying in a field.3
A baseballglove with a baseball inside and a bat on a table 2. A baseballmitt and glove are laying in a field.3. A baseballbat, ball and glove laying on a playing field4. A man swinging a baseballbat at a ball during a game. Direction 1Direction 2Direction 3 imagescaptions OpenCLIP CLIP Cosine Kernel Gaussian Kernel Figure 20.Multimodal discrepancy analys...
-
[42]
A man riding a waveon top of a surfboard.3
A person riding a waveon a surfboard.2. A man riding a waveon top of a surfboard.3. A man riding a waveon top of a surfboard.4. A man riding a surfboard on top of a wave
-
[43]
a person riding skison a snowy slope3
a person riding skison a snowy slope2. a person riding skison a snowy slope3. A couple of people riding snowboardsdown a snow covered slope.4. A man riding skisdown a snow covered slope. Direction 1Direction 2Direction 3 imagescaptions1. A baseballplayer at bat swinging at a pitch in a baseball game.2. A baseballbatter is swinging a bat at an incoming pit...
-
[44]
A surfer riding a wave in the ocean3
A surfer is riding a wave in the ocean.2. A surfer riding a wave in the ocean3. A surfer is on his board in the middle of an ocean spraying wave.4. A man surfing waves on his surf boardin the ocean
-
[45]
A bathroom with a white toilet next to a sink.3
A baseball player swinging a bat on top of a field.2. A bathroom with a white toilet next to a sink.3. A baseball player swinging a bat on top of a baseball field.4. A white toilet sitting next to a white sink in a bathroom. Direction 1Direction 2Direction 3 imagescaptions1. A baseball player hits a ball during a game.2.a batter swinging a bat at a ball a...
-
[46]
Guy doing a flip trick with his skateboard at the park3
there is a man on a skate boarddoing a trick2. Guy doing a flip trick with his skateboard at the park3. A surfer riding a wave in the ocean 4. A surfer carrying his surf boardout of the ocean
-
[47]
A tennis player goes to hit the ball3
A man riding a surfboard on a wave in the ocean.2. A tennis player goes to hit the ball3. a tennis player rushing to the net to hit the ball4. A man riding a wave on a surfboard. Direction 1Direction 2Direction 3 imagescaptions1. a batter swinging a bat at a ball at a baseball game2. A baseball player ready to swing at a baseball game.3. A baseball player...
-
[48]
A large giraffe eating leaves in an enclosure3
A red train parked in front of a loading platform next to passengers.2. A large giraffe eating leaves in an enclosure3. a very cluttered bathroom with a cat in the sink4. A giraffe eating food from the top of the tree
-
[49]
A couple of guys wearing skis and a snowboard.3
A toilet, sink, mirror, and tub in a bathroom.2. A couple of guys wearing skis and a snowboard.3. A bathroom area of plane with a sink and toilet.4. A clean sink is in the middle of the counter. Direction 1Direction 2Direction 3 imagescaptions1. A man that is on a curb with a skateboard.2. A very large city sitting along side of a large body of water.3. A...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.