What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

Hyejin Go; Hyesong Choi; Semi Lee

arxiv: 2605.22651 · v1 · pith:UBZLRB77new · submitted 2026-05-21 · 💻 cs.CV

What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

Hyejin Go , Semi Lee , Hyesong Choi This is my paper

Pith reviewed 2026-05-22 06:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language pretrainingcompositional generalizationdata curationcounterfactual interventionCLIPphrase sensitivityimage-text alignmentrelation understanding

0 comments

The pith

Counterfactual nonce substitutions on caption phrases produce sensitivity scores that select data subsets improving compositional performance over global alignment filtering in vision-language pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CLIP-style pretraining filters image-text pairs by overall alignment, but this signal stops improving once coarse errors are removed because it cannot tell whether individual phrases inside a caption actually support the match. The paper introduces Counterfactual Phrase Intervention to replace specific words with nonce tokens and measure how much each change affects the image-text score. Ranking surviving pairs by these phrase-sensitivity scores lets a 50-percent subset outperform the full dataset on relation and compositionality benchmarks while preserving transfer performance. The approach works orthogonally to existing losses and yields further gains when applied to NegCLIP.

Core claim

Global pair-level alignment conflates broad plausibility with whether specific object, attribute, and relation phrases materially drive the image-text match. CPI converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores, then ranks the post-filter pool by this first-order signal. At CC3M scale the top 50 percent of pairs improve VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, raise SugarCrepe scores, and leave general transfer intact; the same ranking applied to NegCLIP adds another +3.84 on the relation metric.

What carries the argument

Counterfactual Phrase Intervention (CPI), which turns controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores used to rank captions by their measurable compositional contribution to alignment.

If this is right

A 50-percent subset ranked by CPI raises VL-CheckList-VG Relation by +1.91 over the full-data baseline.
The same subset improves SugarCrepe overall while preserving general transfer performance.
At matched data budget CPI outperforms alignment-only filtering by +1.00 on the relation metric.
CPI is loss-orthogonal and adds +3.84 on VL-CheckList-VG Relation when applied unchanged to NegCLIP.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar nonce-substitution ranking could be tested on video or audio-text datasets to check whether phrase sensitivity generalizes beyond still images.
Iterative application of CPI during training might further amplify compositional gains by repeatedly pruning low-sensitivity pairs.
If the selected subsets reduce caption hallucinations in downstream generation tasks, the method would offer a practical filter for production VL pipelines.

Load-bearing premise

Nonce-token substitutions create reliable image-conditioned phrase-sensitivity scores that isolate compositional contributions without substitution artifacts or conflation with global alignment effects.

What would settle it

Training a model on the CPI-ranked 50 percent subset and observing no improvement (or reversal) on VL-CheckList-VG Relation and SugarCrepe relative to the full-data or alignment-only baselines would falsify the ranking signal.

Figures

Figures reproduced from arXiv: 2605.22651 by Hyejin Go, Hyesong Choi, Semi Lee.

**Figure 2.** Figure 2: Stage 2 CPI scoring: illustration and empirical score distributions. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CPI gives a practical phrase-sensitivity filter that beats global alignment on compositional benchmarks with half the data, but the nonce substitutions risk measuring caption plausibility instead of isolated phrase contributions.

read the letter

Hi colleague, the main thing here is that once global alignment has removed obvious bad pairs, further tightening it stops tracking the compositional content in the captions. They address this with Counterfactual Phrase Intervention, which swaps individual phrases for nonce tokens and ranks pairs by how much that changes the image-text alignment score. This produces a 50% subset that lifts VL-CheckList-VG Relation by 1.91 over the full CC3M baseline and 1.00 over alignment-only filtering at the same budget, with gains on SugarCrepe and no loss on general transfer. The same signal stacks on top of NegCLIP for an extra 3.84 on the relation benchmark. That is the concrete empirical result worth noting. What is new is the controlled substitution step that turns the global score into a phrase-level sensitivity signal rather than another grounding or identification trick. The evaluation keeps the comparison fair by matching data budgets and using external compositional tests, which is better than many filtering papers. The soft spot is exactly the one in the stress-test note. Nonce substitutions can easily make the rest of the caption less grammatical or semantically odd, so the delta may partly reflect how much the model dislikes weird text overall instead of how well the original phrase grounds to the image. The abstract does not spell out how the nonce tokens are chosen or whether they ran checks for substitution artifacts, so the +1.91 gain could contain some extra plausibility filtering on top of the intended compositional signal. If the full paper has examples, ablations, or controls that keep substitutions minimal and local, that concern shrinks; otherwise it stays load-bearing. The rest of the paper looks standard: external benchmarks, no obvious circularity in the claims, and a straightforward empirical setup. This is for people who curate or filter large vision-language datasets and care about compositionality without adding more data. A reader working on similar selection methods would get usable numbers and a clear baseline to beat. It is worth sending to peer review because the gains are specific, the method is simple to reimplement, and the open question about substitution artifacts is fixable with targeted revisions rather than a fundamental flaw.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that global alignment scores in CLIP-style contrastive pretraining saturate and fail to capture compositional supervision from individual phrases in captions. It introduces Counterfactual Phrase Intervention (CPI), which uses controlled nonce-token substitutions to derive image-conditioned phrase-sensitivity scores for ranking and selecting data after coarse global filtering. At CC3M scale, a 50% subset ranked by this signal improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while also improving SugarCrepe overall, preserving general transfer, and yielding further gains (+3.84 on VL-CheckList-VG Relation) when applied loss-orthogonally to NegCLIP.

Significance. If the phrase-sensitivity scores reliably isolate compositional contributions, the work would offer a practical, loss-orthogonal approach to data curation that improves compositional generalization at reduced data budgets without altering the training objective. The reported gains on relation and composition benchmarks, combined with maintained general performance, suggest value for efficient pretraining pipelines. The loss-orthogonal demonstration on NegCLIP strengthens the case for broad applicability.

major comments (2)

[Abstract] Abstract: The central empirical claim (+1.91 / +1.00 gains on VL-CheckList-VG Relation) is load-bearing on the assumption that nonce-token substitutions produce phrase-sensitivity scores that isolate compositional grounding rather than global coherence or plausibility artifacts. The skeptic concern that substitutions may systematically degrade grammaticality or introduce OOD text, causing the delta to partly measure implausibility penalties instead of phrase-image support, is not addressed by the provided description; explicit controls or analysis are needed to show the signal does not reduce to stricter global filtering.
[Evaluation] Evaluation section: The abstract reports benchmark improvements but provides no implementation details, substitution strategy, or error analysis. To substantiate that CPI captures compositional supervision beyond alignment-only baselines, the manuscript must include ablations verifying that high-sensitivity phrases correlate with relation/attribute understanding independently of overall caption plausibility.

minor comments (1)

The description of how nonce tokens are chosen and whether they preserve local syntactic validity would benefit from additional clarification to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns about potential confounds in the nonce-substitution signal and to provide the requested implementation details, ablations, and analysis. Our responses to the major comments are below.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim (+1.91 / +1.00 gains on VL-CheckList-VG Relation) is load-bearing on the assumption that nonce-token substitutions produce phrase-sensitivity scores that isolate compositional grounding rather than global coherence or plausibility artifacts. The skeptic concern that substitutions may systematically degrade grammaticality or introduce OOD text, causing the delta to partly measure implausibility penalties instead of phrase-image support, is not addressed by the provided description; explicit controls or analysis are needed to show the signal does not reduce to stricter global filtering.

Authors: We agree that this is a critical point and that the original description did not sufficiently rule out the possibility that the sensitivity signal partly reflects global plausibility or grammatical degradation rather than phrase-specific compositional support. In the revised manuscript we have added a dedicated analysis subsection (Section 4.3) that (i) quantifies grammaticality changes using an external parser before and after nonce substitution, (ii) compares CPI-ranked subsets against a stricter global-alignment baseline at the same data budget, and (iii) reports a control using random (non-phrase-targeted) substitutions. The added results show that the reported gains on VL-CheckList-VG Relation persist after these controls and are not explained by stricter global filtering alone. revision: yes
Referee: [Evaluation] Evaluation section: The abstract reports benchmark improvements but provides no implementation details, substitution strategy, or error analysis. To substantiate that CPI captures compositional supervision beyond alignment-only baselines, the manuscript must include ablations verifying that high-sensitivity phrases correlate with relation/attribute understanding independently of overall caption plausibility.

Authors: We acknowledge that the original manuscript lacked sufficient implementation details and ablations. The revised Evaluation section now includes: (1) a precise description of the nonce-token substitution procedure, including how target phrases are identified and replaced; (2) all hyperparameters and model choices used to compute the image-conditioned sensitivity scores; (3) a qualitative error analysis on 200 sampled high- and low-sensitivity phrases; and (4) new ablations that measure the correlation between phrase sensitivity and downstream relation/attribute performance while controlling for overall caption plausibility via matched-plausibility random-phrase baselines. These additions directly demonstrate that the CPI signal isolates compositional contributions beyond what global alignment provides. revision: yes

Circularity Check

0 steps flagged

No significant circularity: method and gains are externally benchmarked.

full rationale

The paper proposes CPI as a phrase-level ranking signal derived from nonce substitutions and alignment deltas, then evaluates the resulting 50% subset on independent external benchmarks (VL-CheckList-VG Relation, SugarCrepe) against both full-data and alignment-only baselines. No equation or step reduces the reported improvements to a quantity fitted inside the paper's own scoring procedure; the central claim remains a comparative empirical result on held-out compositional tests rather than a self-referential re-expression of the input alignment scores. Self-citations, if present, are not load-bearing for the uniqueness or correctness of the sensitivity metric.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; CPI is treated as a methodological contribution rather than an introduced physical entity.

pith-pipeline@v0.9.0 · 5807 in / 1082 out tokens · 34947 ms · 2026-05-22T06:16:17.182400+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores... Three-Invariance Replacement Protocol... Δj := s(I,T)−s(I,˜Tj)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

50%-data subset that improves VL-CheckList-VG Relation by +1.91

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

SugarCrepe: Fixing hackable benchmarks for vision-language compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 9

work page 2023
[2]

SugarCrepe++: Beyond accuracy in compositional understanding.Advances in Neural Information Processing Systems (NeurIPS), 2024

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. SugarCrepe++: Beyond accuracy in compositional understanding.Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[3]

CLIPScore: A reference- free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference- free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 7514–7528, 2021

work page 2021
[4]

DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. DataComp: In search of the next generation of multimodal datasets. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[5]

Lipton, J

Pratyush Maini, Sachin Goyal, Zachary C. Lipton, J. Zico Kolter, and Aditi Raghunathan. T-MARS: Improving visual representations by circumventing text feature learning. InInternational Conference on Learning Representations (ICLR), 2024. Originally arXiv:2307.03132, 2023

work page arXiv 2024
[6]

Data filtering networks

Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. InInternational Conference on Learning Representations (ICLR),

work page
[7]

arXiv:2309.17425, 2023

work page arXiv 2023
[8]

Sieve: Multimodal dataset pruning using image captioning models

Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, and Ari Morcos. Sieve: Multimodal dataset pruning using image captioning models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[9]

Demystifying CLIP data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP data. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[10]

Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, Talfan Evans, Samuel Albanie, Federico Tombari, Yongqin Xian, Alessio Tonioni, and Olivier J. Hénaff. Active data curation effectively distills large-scale multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[11]

D4: Improving LLM pretraining via document de-duplication and diversification

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. D4: Improving LLM pretraining via document de-duplication and diversification. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[12]

Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. SemDeDup: Data- efficient learning at web-scale through semantic deduplication. InICLR Workshop on Multimodal Repre- sentation Learning, 2023

work page 2023
[13]

Brauner, Muhammed T

Sören Mindermann, Jan M. Brauner, Muhammed T. Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. InProceedings of the 39th International Conference on Machine Learning (ICML), pages 156...

work page 2022
[14]

Bad students make great teachers: Active learning accelerates large-scale visual understanding

Talfan Evans, Shreya Pathak, Hamza Merzic, Jonathan Schwarz, Khimya Khetarpal, Jordan Tang, Larisa Markeeva, Felix Hill, and Razvan Pascanu. Bad students make great teachers: Active learning accelerates large-scale visual understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[15]

When and why vision-language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Representations (ICLR), 2023

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[16]

Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding

Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[17]

Preserving multi-modal capabilities of pre-trained VLMs for improving vision-linguistic compositionality

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Preserving multi-modal capabilities of pre-trained VLMs for improving vision-linguistic compositionality. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024
[18]

TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives

Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 10

work page 2024
[19]

Dense and aligned captions (DAC) promote compositional reasoning in VL models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, et al. Dense and aligned captions (DAC) promote compositional reasoning in VL models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[20]

READ: Reinforcement-based adversarial learning for text-image compositional reasoning

Ugur Sahin, Hang Li, Qadeer Khan, Daniel Cremers, and V olker Tresp. READ: Reinforcement-based adversarial learning for text-image compositional reasoning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

work page 2024
[21]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V). In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 2668–2677, 2018

work page 2018
[22]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 3319–3328, 2017

work page 2017
[23]

CLIP-Dissect: Automatic description of neuron representations in deep vision networks

Tuomas Oikarinen and Tsui-Wei Weng. CLIP-Dissect: Automatic description of neuron representations in deep vision networks. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[24]

Andonian, and Yonatan Belinkov

Kevin Meng, David Bau, Alex J. Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[25]

What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.Transactions of the Association for Computational Linguistics (TACL), 8:34–48, 2020

Allyson Ettinger. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.Transactions of the Association for Computational Linguistics (TACL), 8:34–48, 2020. 11

work page 2020

[1] [1]

SugarCrepe: Fixing hackable benchmarks for vision-language compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 9

work page 2023

[2] [2]

SugarCrepe++: Beyond accuracy in compositional understanding.Advances in Neural Information Processing Systems (NeurIPS), 2024

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. SugarCrepe++: Beyond accuracy in compositional understanding.Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[3] [3]

CLIPScore: A reference- free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference- free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 7514–7528, 2021

work page 2021

[4] [4]

DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. DataComp: In search of the next generation of multimodal datasets. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[5] [5]

Lipton, J

Pratyush Maini, Sachin Goyal, Zachary C. Lipton, J. Zico Kolter, and Aditi Raghunathan. T-MARS: Improving visual representations by circumventing text feature learning. InInternational Conference on Learning Representations (ICLR), 2024. Originally arXiv:2307.03132, 2023

work page arXiv 2024

[6] [6]

Data filtering networks

Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. InInternational Conference on Learning Representations (ICLR),

work page

[7] [7]

arXiv:2309.17425, 2023

work page arXiv 2023

[8] [8]

Sieve: Multimodal dataset pruning using image captioning models

Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, and Ari Morcos. Sieve: Multimodal dataset pruning using image captioning models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[9] [9]

Demystifying CLIP data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP data. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[10] [10]

Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, Talfan Evans, Samuel Albanie, Federico Tombari, Yongqin Xian, Alessio Tonioni, and Olivier J. Hénaff. Active data curation effectively distills large-scale multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[11] [11]

D4: Improving LLM pretraining via document de-duplication and diversification

Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. D4: Improving LLM pretraining via document de-duplication and diversification. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[12] [12]

Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. SemDeDup: Data- efficient learning at web-scale through semantic deduplication. InICLR Workshop on Multimodal Repre- sentation Learning, 2023

work page 2023

[13] [13]

Brauner, Muhammed T

Sören Mindermann, Jan M. Brauner, Muhammed T. Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. InProceedings of the 39th International Conference on Machine Learning (ICML), pages 156...

work page 2022

[14] [14]

Bad students make great teachers: Active learning accelerates large-scale visual understanding

Talfan Evans, Shreya Pathak, Hamza Merzic, Jonathan Schwarz, Khimya Khetarpal, Jordan Tang, Larisa Markeeva, Felix Hill, and Razvan Pascanu. Bad students make great teachers: Active learning accelerates large-scale visual understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[15] [15]

When and why vision-language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Representations (ICLR), 2023

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[16] [16]

Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding

Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[17] [17]

Preserving multi-modal capabilities of pre-trained VLMs for improving vision-linguistic compositionality

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Preserving multi-modal capabilities of pre-trained VLMs for improving vision-linguistic compositionality. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024

[18] [18]

TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives

Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 10

work page 2024

[19] [19]

Dense and aligned captions (DAC) promote compositional reasoning in VL models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, et al. Dense and aligned captions (DAC) promote compositional reasoning in VL models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[20] [20]

READ: Reinforcement-based adversarial learning for text-image compositional reasoning

Ugur Sahin, Hang Li, Qadeer Khan, Daniel Cremers, and V olker Tresp. READ: Reinforcement-based adversarial learning for text-image compositional reasoning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

work page 2024

[21] [21]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V). In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 2668–2677, 2018

work page 2018

[22] [22]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 3319–3328, 2017

work page 2017

[23] [23]

CLIP-Dissect: Automatic description of neuron representations in deep vision networks

Tuomas Oikarinen and Tsui-Wei Weng. CLIP-Dissect: Automatic description of neuron representations in deep vision networks. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[24] [24]

Andonian, and Yonatan Belinkov

Kevin Meng, David Bau, Alex J. Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[25] [25]

What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.Transactions of the Association for Computational Linguistics (TACL), 8:34–48, 2020

Allyson Ettinger. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.Transactions of the Association for Computational Linguistics (TACL), 8:34–48, 2020. 11

work page 2020