What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining
Pith reviewed 2026-05-22 06:16 UTC · model grok-4.3
The pith
Counterfactual nonce substitutions on caption phrases produce sensitivity scores that select data subsets improving compositional performance over global alignment filtering in vision-language pretraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Global pair-level alignment conflates broad plausibility with whether specific object, attribute, and relation phrases materially drive the image-text match. CPI converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores, then ranks the post-filter pool by this first-order signal. At CC3M scale the top 50 percent of pairs improve VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, raise SugarCrepe scores, and leave general transfer intact; the same ranking applied to NegCLIP adds another +3.84 on the relation metric.
What carries the argument
Counterfactual Phrase Intervention (CPI), which turns controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores used to rank captions by their measurable compositional contribution to alignment.
If this is right
- A 50-percent subset ranked by CPI raises VL-CheckList-VG Relation by +1.91 over the full-data baseline.
- The same subset improves SugarCrepe overall while preserving general transfer performance.
- At matched data budget CPI outperforms alignment-only filtering by +1.00 on the relation metric.
- CPI is loss-orthogonal and adds +3.84 on VL-CheckList-VG Relation when applied unchanged to NegCLIP.
Where Pith is reading between the lines
- Similar nonce-substitution ranking could be tested on video or audio-text datasets to check whether phrase sensitivity generalizes beyond still images.
- Iterative application of CPI during training might further amplify compositional gains by repeatedly pruning low-sensitivity pairs.
- If the selected subsets reduce caption hallucinations in downstream generation tasks, the method would offer a practical filter for production VL pipelines.
Load-bearing premise
Nonce-token substitutions create reliable image-conditioned phrase-sensitivity scores that isolate compositional contributions without substitution artifacts or conflation with global alignment effects.
What would settle it
Training a model on the CPI-ranked 50 percent subset and observing no improvement (or reversal) on VL-CheckList-VG Relation and SugarCrepe relative to the full-data or alignment-only baselines would falsify the ranking signal.
Figures
read the original abstract
CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that global alignment scores in CLIP-style contrastive pretraining saturate and fail to capture compositional supervision from individual phrases in captions. It introduces Counterfactual Phrase Intervention (CPI), which uses controlled nonce-token substitutions to derive image-conditioned phrase-sensitivity scores for ranking and selecting data after coarse global filtering. At CC3M scale, a 50% subset ranked by this signal improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while also improving SugarCrepe overall, preserving general transfer, and yielding further gains (+3.84 on VL-CheckList-VG Relation) when applied loss-orthogonally to NegCLIP.
Significance. If the phrase-sensitivity scores reliably isolate compositional contributions, the work would offer a practical, loss-orthogonal approach to data curation that improves compositional generalization at reduced data budgets without altering the training objective. The reported gains on relation and composition benchmarks, combined with maintained general performance, suggest value for efficient pretraining pipelines. The loss-orthogonal demonstration on NegCLIP strengthens the case for broad applicability.
major comments (2)
- [Abstract] Abstract: The central empirical claim (+1.91 / +1.00 gains on VL-CheckList-VG Relation) is load-bearing on the assumption that nonce-token substitutions produce phrase-sensitivity scores that isolate compositional grounding rather than global coherence or plausibility artifacts. The skeptic concern that substitutions may systematically degrade grammaticality or introduce OOD text, causing the delta to partly measure implausibility penalties instead of phrase-image support, is not addressed by the provided description; explicit controls or analysis are needed to show the signal does not reduce to stricter global filtering.
- [Evaluation] Evaluation section: The abstract reports benchmark improvements but provides no implementation details, substitution strategy, or error analysis. To substantiate that CPI captures compositional supervision beyond alignment-only baselines, the manuscript must include ablations verifying that high-sensitivity phrases correlate with relation/attribute understanding independently of overall caption plausibility.
minor comments (1)
- The description of how nonce tokens are chosen and whether they preserve local syntactic validity would benefit from additional clarification to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns about potential confounds in the nonce-substitution signal and to provide the requested implementation details, ablations, and analysis. Our responses to the major comments are below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claim (+1.91 / +1.00 gains on VL-CheckList-VG Relation) is load-bearing on the assumption that nonce-token substitutions produce phrase-sensitivity scores that isolate compositional grounding rather than global coherence or plausibility artifacts. The skeptic concern that substitutions may systematically degrade grammaticality or introduce OOD text, causing the delta to partly measure implausibility penalties instead of phrase-image support, is not addressed by the provided description; explicit controls or analysis are needed to show the signal does not reduce to stricter global filtering.
Authors: We agree that this is a critical point and that the original description did not sufficiently rule out the possibility that the sensitivity signal partly reflects global plausibility or grammatical degradation rather than phrase-specific compositional support. In the revised manuscript we have added a dedicated analysis subsection (Section 4.3) that (i) quantifies grammaticality changes using an external parser before and after nonce substitution, (ii) compares CPI-ranked subsets against a stricter global-alignment baseline at the same data budget, and (iii) reports a control using random (non-phrase-targeted) substitutions. The added results show that the reported gains on VL-CheckList-VG Relation persist after these controls and are not explained by stricter global filtering alone. revision: yes
-
Referee: [Evaluation] Evaluation section: The abstract reports benchmark improvements but provides no implementation details, substitution strategy, or error analysis. To substantiate that CPI captures compositional supervision beyond alignment-only baselines, the manuscript must include ablations verifying that high-sensitivity phrases correlate with relation/attribute understanding independently of overall caption plausibility.
Authors: We acknowledge that the original manuscript lacked sufficient implementation details and ablations. The revised Evaluation section now includes: (1) a precise description of the nonce-token substitution procedure, including how target phrases are identified and replaced; (2) all hyperparameters and model choices used to compute the image-conditioned sensitivity scores; (3) a qualitative error analysis on 200 sampled high- and low-sensitivity phrases; and (4) new ablations that measure the correlation between phrase sensitivity and downstream relation/attribute performance while controlling for overall caption plausibility via matched-plausibility random-phrase baselines. These additions directly demonstrate that the CPI signal isolates compositional contributions beyond what global alignment provides. revision: yes
Circularity Check
No significant circularity: method and gains are externally benchmarked.
full rationale
The paper proposes CPI as a phrase-level ranking signal derived from nonce substitutions and alignment deltas, then evaluates the resulting 50% subset on independent external benchmarks (VL-CheckList-VG Relation, SugarCrepe) against both full-data and alignment-only baselines. No equation or step reduces the reported improvements to a quantity fitted inside the paper's own scoring procedure; the central claim remains a comparative empirical result on held-out compositional tests rather than a self-referential re-expression of the input alignment scores. Self-citations, if present, are not load-bearing for the uniqueness or correctness of the sensitivity metric.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores... Three-Invariance Replacement Protocol... Δj := s(I,T)−s(I,˜Tj)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
50%-data subset that improves VL-CheckList-VG Relation by +1.91
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
SugarCrepe: Fixing hackable benchmarks for vision-language compositionality
Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 9
work page 2023
-
[2]
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. SugarCrepe++: Beyond accuracy in compositional understanding.Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[3]
CLIPScore: A reference- free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference- free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 7514–7528, 2021
work page 2021
-
[4]
DataComp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. DataComp: In search of the next generation of multimodal datasets. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[5]
Pratyush Maini, Sachin Goyal, Zachary C. Lipton, J. Zico Kolter, and Aditi Raghunathan. T-MARS: Improving visual representations by circumventing text feature learning. InInternational Conference on Learning Representations (ICLR), 2024. Originally arXiv:2307.03132, 2023
-
[6]
Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. InInternational Conference on Learning Representations (ICLR),
- [7]
-
[8]
Sieve: Multimodal dataset pruning using image captioning models
Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, and Ari Morcos. Sieve: Multimodal dataset pruning using image captioning models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[9]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP data. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[10]
Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, Talfan Evans, Samuel Albanie, Federico Tombari, Yongqin Xian, Alessio Tonioni, and Olivier J. Hénaff. Active data curation effectively distills large-scale multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[11]
D4: Improving LLM pretraining via document de-duplication and diversification
Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. D4: Improving LLM pretraining via document de-duplication and diversification. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[12]
Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. SemDeDup: Data- efficient learning at web-scale through semantic deduplication. InICLR Workshop on Multimodal Repre- sentation Learning, 2023
work page 2023
-
[13]
Sören Mindermann, Jan M. Brauner, Muhammed T. Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. InProceedings of the 39th International Conference on Machine Learning (ICML), pages 156...
work page 2022
-
[14]
Bad students make great teachers: Active learning accelerates large-scale visual understanding
Talfan Evans, Shreya Pathak, Hamza Merzic, Jonathan Schwarz, Khimya Khetarpal, Jordan Tang, Larisa Markeeva, Felix Hill, and Razvan Pascanu. Bad students make great teachers: Active learning accelerates large-scale visual understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[15]
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[16]
Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[17]
Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Preserving multi-modal capabilities of pre-trained VLMs for improving vision-linguistic compositionality. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
work page 2024
-
[18]
TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives
Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 10
work page 2024
-
[19]
Dense and aligned captions (DAC) promote compositional reasoning in VL models
Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, et al. Dense and aligned captions (DAC) promote compositional reasoning in VL models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[20]
READ: Reinforcement-based adversarial learning for text-image compositional reasoning
Ugur Sahin, Hang Li, Qadeer Khan, Daniel Cremers, and V olker Tresp. READ: Reinforcement-based adversarial learning for text-image compositional reasoning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024
work page 2024
-
[21]
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V). In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 2668–2677, 2018
work page 2018
-
[22]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 3319–3328, 2017
work page 2017
-
[23]
CLIP-Dissect: Automatic description of neuron representations in deep vision networks
Tuomas Oikarinen and Tsui-Wei Weng. CLIP-Dissect: Automatic description of neuron representations in deep vision networks. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[24]
Andonian, and Yonatan Belinkov
Kevin Meng, David Bau, Alex J. Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[25]
Allyson Ettinger. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.Transactions of the Association for Computational Linguistics (TACL), 8:34–48, 2020. 11
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.