Recognition: unknown
Investigating More Explainable and Partition-Free Compositionality Estimation for LLMs: A Rule-Generation Perspective
Pith reviewed 2026-05-07 09:48 UTC · model grok-4.3
The pith
Generating rule programs lets LLMs show their compositional understanding without partitioned test sets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The rule-generation perspective requires LLMs to generate a program as rules for dataset mapping and provides estimates of the compositionality of LLMs using complexity-based theory. This addresses the limitations of compositional generalization tests by focusing on the model's understanding of sample compositionality and avoiding combination leakage from dataset partitions. Experiments on a string-to-grid task reveal various compositionality characterizations and compositionality deficiencies in existing advanced LLMs.
What carries the argument
The rule-generation process in which an LLM produces an explicit program encoding the input-to-output mapping rules, quantified via complexity-based theory.
If this is right
- Compositionality estimates become possible without constructing any test set of unseen input combinations.
- Evaluation can inspect the model's explicit rule representation rather than only its final answer correctness.
- Different LLMs display distinguishable patterns of compositional strength and failure on the same mapping task.
- Specific deficiencies, such as failure to capture certain rule interactions, become identifiable through program inspection.
Where Pith is reading between the lines
- The same rule-generation lens could be applied to other structured domains such as arithmetic or logical reasoning to test generality.
- Generated programs might be fed back as training signals to encourage models to internalize explicit compositional rules.
- Program complexity scores could be tracked over model scale or training data volume to chart progress on compositionality.
Load-bearing premise
That the program an LLM writes to describe a mapping genuinely reflects its internal compositional understanding and that program complexity is a valid proxy for that understanding.
What would settle it
Finding that models whose generated programs have lower measured complexity do not outperform higher-complexity models on novel combinations outside the original task distribution.
Figures
read the original abstract
Compositional generalization tests are often used to estimate the compositionality of LLMs. However, such tests have the following limitations: (1) they only focus on the output results without considering LLMs' understanding of sample compositionality, resulting in explainability defects; (2) they rely on dataset partition to form the test set with combinations unseen in the training set, suffering from combination leakage issues. In this work, we propose a novel rule-generation perspective for compositionality estimation for LLMs. It requires LLMs to generate a program as rules for dataset mapping and provides estimates of the compositionality of LLMs using complexity-based theory. The perspective addresses the limitations of compositional generalization tests and provides a new way to analyze the compositionality characterization of LLMs. We conduct experiments and analysis of existing advanced LLMs based on this perspective on a string-to-grid task, and find various compositionality characterizations and compositionality deficiencies exhibited by LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that compositional generalization tests for LLMs suffer from explainability defects (focusing only on outputs, not internal understanding) and combination leakage from dataset partitions. It proposes a partition-free alternative: prompting LLMs to generate programs encoding dataset mappings, then estimating compositionality via complexity-based theory applied to those programs. Experiments on a string-to-grid task are reported to reveal varied compositionality characterizations and deficiencies across advanced LLMs.
Significance. If the mapping from program complexity to compositionality is shown to be valid and bias-free, the approach could provide a more interpretable, split-independent probe of LLMs' rule comprehension, addressing a genuine gap in current evaluation practices. However, the absence of derivation details, error analysis, or calibration in the provided description limits immediate assessment of its potential impact.
major comments (2)
- [Method / Experiments] The central claim that LLM-generated program complexity yields a valid estimate of compositionality understanding is load-bearing but unsupported by validation. No controls for program correctness, minimality, or equivalence classes are described, raising the risk that stylistic or fluency artifacts (rather than rule comprehension) drive the complexity scores.
- [Experiments] The string-to-grid experiments report 'various characterizations' without calibration against ground-truth compositional models or standard benchmarks. This leaves open whether the complexity scores distinguish compositional from non-compositional mappings in a controlled way.
minor comments (1)
- [Abstract] The abstract states the motivation and high-level method but supplies no derivation details, validation of the complexity measure, or quantitative results, making it difficult to assess soundness from the summary alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validation and calibration that we will address to strengthen the work. We respond point-by-point to the major comments below.
read point-by-point responses
-
Referee: [Method / Experiments] The central claim that LLM-generated program complexity yields a valid estimate of compositionality understanding is load-bearing but unsupported by validation. No controls for program correctness, minimality, or equivalence classes are described, raising the risk that stylistic or fluency artifacts (rather than rule comprehension) drive the complexity scores.
Authors: We agree that explicit validation strengthens the central claim. While the manuscript applies complexity theory directly to generated programs as a proxy for rule understanding, we will revise to include: (1) a post-generation correctness filter that executes programs on held-out input-output pairs to confirm they implement the intended mapping; (2) explicit discussion of how the chosen complexity measure (approximating Kolmogorov complexity via program length) inherently penalizes non-minimal descriptions; and (3) an equivalence-class analysis showing that semantically identical rules expressed in different syntactic styles produce comparable complexity scores in our string-to-grid experiments. These additions mitigate concerns about stylistic artifacts driving the results. revision: yes
-
Referee: [Experiments] The string-to-grid experiments report 'various characterizations' without calibration against ground-truth compositional models or standard benchmarks. This leaves open whether the complexity scores distinguish compositional from non-compositional mappings in a controlled way.
Authors: The string-to-grid task was selected for its transparent primitive rules (e.g., independent control of color, shape, and position), allowing direct observation of compositionality. We acknowledge that the current results are primarily descriptive. In revision, we will add controlled calibration experiments: synthetic datasets with explicitly varied compositionality levels (from atomic rules to multi-rule compositions) and direct comparison of our complexity estimates against ground-truth measures. We will also report correlations with performance on established benchmarks such as SCAN to show that lower complexity scores align with stronger compositional generalization. This will provide the requested controlled distinction. revision: yes
Circularity Check
No circularity: external complexity theory applied to LLM-generated programs
full rationale
The derivation chain begins with prompting LLMs to emit programs encoding dataset mappings on a string-to-grid task, then applies an external complexity-based theory (description length or similar) to produce a compositionality estimate. No equation or step reduces the output to a fitted parameter, self-defined quantity, or self-citation chain; the estimate is computed from the generated programs rather than being presupposed by the inputs. The approach is therefore self-contained and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Anthropic. 2024. https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf Claude 3.5 sonnet model card addendum
2024
-
[4]
Anthropic. 2025. https://assets.anthropic.com/m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf Claude 3.7 sonnet system card
2025
-
[5]
Courville
Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron C. Courville. 2019. https://openreview.net/forum?id=HkezXnA9YX Systematic generalization: What is required and can it be learned? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net
2019
-
[6]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review arXiv 2025
-
[7]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...
work page internal anchor Pith review arXiv 2025
-
[8]
Stanislas Dehaene, Fosca Al Roumi, Yair Lakretz, Samuel Planton, and Mathias Sabl \'e -Meyer. 2022. https://doi.org/10.1016/j.tics.2022.06.010 Symbols and mental programs: a hypothesis about human singularity . Trends in Cognitive Sciences, 26(9):751--766
- [9]
-
[10]
Google. 2024. https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#ceo-message Introducing gemini 2.0: our new ai model for the agentic era
2024
-
[11]
Google. 2025. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking Gemini 2.5: Our most intelligent ai model
2025
-
[12]
Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. 2020. https://doi.org/10.1613/JAIR.1.11674 Compositionality decomposed: How do neural networks generalise? J. Artif. Intell. Res., 67:757--795
-
[13]
Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel, Christos Christodoulopoulos, Karim Lasri, Naomi Saphra, Arabella Sinclair, Dennis Ulmer, Florian Schottmann, Khuyagbaatar Batsuren, Kaiser Sun, Koustuv Sinha, Leila Khalatbari, Maria Ryskina, Rita Frieske, Ryan Cotterell, and Zhijing Jin. 2022. https://doi.org/1...
-
[14]
Theo M.V. Janssen and Barbara H. Partee. 1997. https://doi.org/10.1016/B978-044481714-3/50011-4 Chapter 7 - compositionality . In Johan van Benthem and Alice ter Meulen , editors, Handbook of Logic and Language, pages 417--473. North-Holland, Amsterdam
-
[15]
Daniel Keysers, Nathanael Sch \" a rli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. 2020. https://openreview.net/forum?id=SygcCnNKwr Measuring compositional generalization: A comprehensive method on realistic...
2020
-
[16]
Najoung Kim and Tal Linzen. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.731 COGS: A compositional generalization challenge based on semantic interpretation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020 , pages 9087--9105. Association for Computational Linguistics
-
[17]
Andrei N Kolmogorov. 1965. http://alexander.shen.free.fr/library/Kolmogorov65_Three-Approaches-to-Information.pdf Three approaches to the quantitative definition ofinformation’ . Problems of information transmission, 1(1):1--7
1965
-
[18]
Lake and Marco Baroni
Brenden M. Lake and Marco Baroni. 2018. http://proceedings.mlr.press/v80/lake18a.html Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks . In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 , volume 80 of Pr...
2018
-
[19]
Zhaoyi Li, Gangwei Jiang, Hong Xie, Linqi Song, Defu Lian, and Ying Wei. 2024. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.576 Understanding and patching compositional reasoning in llms . In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024 , pages 9668--9688. Association for C...
-
[20]
Kate McCurdy, Paul Soulos, Paul Smolensky, Roland Fernandez, and Jianfeng Gao. 2024. https://doi.org/10.18653/V1/2024.EMNLP-MAIN.524 Toward compositional behavior in neural models: A survey of current views . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024 , pages ...
-
[21]
OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alex...
work page internal anchor Pith review arXiv 2024
-
[22]
OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
OpenAI. 2025. https://cdn.openai.com/o3-mini-system-card-feb10.pdf Openai o3-mini system card
2025
-
[24]
Peter Pagin and Dag Westerståhl. 2010. https://doi.org/10.1111/j.1747-9991.2009.00228.x Compositionality i: Definitions and variants . Philosophy Compass, 5(3):250--264
-
[25]
Francis Jeffry Pelletier. 1994. https://doi.org/10.1007/BF00763644 The principle of semantic compositionality . Topoi, 13(1):11--24
-
[26]
Qwen-Team. 2025. https://qwenlm.github.io/blog/qwq-32b/ Qwq-32b: Embracing the power of reinforcement learning
2025
-
[27]
Zoltán Gendler Szabó. 2004. http://seop.illc.uva.nl/entries/compositionality/ Compositionality . In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University
2004
-
[28]
Thadd\" a us Wiedemer, Prasanna Mayilvahanan, Matthias Bethge, and Wieland Brendel. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/15f6a10899f557ce53fe39939af6f930-Paper-Conference.pdf Compositional generalization from first principles . In Advances in Neural Information Processing Systems, volume 36, pages 6941--6960. Curran Associates, Inc
2023
-
[29]
Ziyao Xu and Houfeng Wang. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.36 SPOR: A comprehensive and practical evaluation method for compositional generalization in data-to-text generation . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024...
-
[30]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.