arxiv: 2604.27340 · v1 · submitted 2026-04-30 · 💻 cs.AI

Recognition: unknown

Investigating More Explainable and Partition-Free Compositionality Estimation for LLMs: A Rule-Generation Perspective

Ziyao Xu , Cong Wang , Houfeng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:48 UTC · model grok-4.3

classification 💻 cs.AI

keywords compositionality estimationlarge language modelsrule generationcompositional generalizationcomplexity-based theorystring-to-grid taskexplainable evaluation

0 comments

The pith

Generating rule programs lets LLMs show their compositional understanding without partitioned test sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard compositional generalization tests only check final outputs and depend on splitting datasets so test items never appeared in training, which creates both explainability gaps and leakage risks. This paper replaces those tests with a rule-generation approach that asks the model itself to write a program capturing the exact mapping rules from input to output. Complexity theory is then applied to the generated program to produce a quantitative estimate of how compositionally the model is reasoning. When applied to string-to-grid mapping tasks, the method uncovers distinct compositional profiles and specific shortcomings across current advanced LLMs.

Core claim

The rule-generation perspective requires LLMs to generate a program as rules for dataset mapping and provides estimates of the compositionality of LLMs using complexity-based theory. This addresses the limitations of compositional generalization tests by focusing on the model's understanding of sample compositionality and avoiding combination leakage from dataset partitions. Experiments on a string-to-grid task reveal various compositionality characterizations and compositionality deficiencies in existing advanced LLMs.

What carries the argument

The rule-generation process in which an LLM produces an explicit program encoding the input-to-output mapping rules, quantified via complexity-based theory.

If this is right

Compositionality estimates become possible without constructing any test set of unseen input combinations.
Evaluation can inspect the model's explicit rule representation rather than only its final answer correctness.
Different LLMs display distinguishable patterns of compositional strength and failure on the same mapping task.
Specific deficiencies, such as failure to capture certain rule interactions, become identifiable through program inspection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rule-generation lens could be applied to other structured domains such as arithmetic or logical reasoning to test generality.
Generated programs might be fed back as training signals to encourage models to internalize explicit compositional rules.
Program complexity scores could be tracked over model scale or training data volume to chart progress on compositionality.

Load-bearing premise

That the program an LLM writes to describe a mapping genuinely reflects its internal compositional understanding and that program complexity is a valid proxy for that understanding.

What would settle it

Finding that models whose generated programs have lower measured complexity do not outperform higher-complexity models on novel combinations outside the original task distribution.

Figures

Figures reproduced from arXiv: 2604.27340 by Cong Wang, Houfeng Wang, Ziyao Xu.

**Figure 1.** Figure 1: Illustrations of compositional generalization tests and rule-generation perspective on a string-to-grid task. view at source ↗

**Figure 2.** Figure 2: Examples of the mapping table of P + (N = 3, M = 6, U = 2, d = 8). We group mappings involving the same atomic input components and mark the involved atomic output components with colors. The leftmost and rightmost examples demonstrate zero and sufficient compositionality. tive definition of the compositionality of mappings from a compression perspective. For object lists I and O, K(O) denotes the length o… view at source ↗

**Figure 3.** Figure 3: An illustration of the experimental settings. view at source ↗

**Figure 4.** Figure 4: Two examples of fragments of programs generated by LLMs. They both have view at source ↗

**Figure 5.** Figure 5: Examples of fragments of programs generated by LLMs. view at source ↗

read the original abstract

Compositional generalization tests are often used to estimate the compositionality of LLMs. However, such tests have the following limitations: (1) they only focus on the output results without considering LLMs' understanding of sample compositionality, resulting in explainability defects; (2) they rely on dataset partition to form the test set with combinations unseen in the training set, suffering from combination leakage issues. In this work, we propose a novel rule-generation perspective for compositionality estimation for LLMs. It requires LLMs to generate a program as rules for dataset mapping and provides estimates of the compositionality of LLMs using complexity-based theory. The perspective addresses the limitations of compositional generalization tests and provides a new way to analyze the compositionality characterization of LLMs. We conduct experiments and analysis of existing advanced LLMs based on this perspective on a string-to-grid task, and find various compositionality characterizations and compositionality deficiencies exhibited by LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes prompting LLMs to generate programs as rules for data mappings then scoring them via complexity theory to estimate compositionality without dataset partitions, but the validation for that mapping is missing.

read the letter

The key point is that the authors suggest prompting LLMs to output programs encoding the rules for a dataset mapping, then scoring those programs with complexity-based theory to estimate compositionality. This is positioned as a way around the explainability gaps and combination leakage in traditional partition-based tests. They do a decent job laying out the problems with existing methods and outlining an alternative that could provide more direct insight into what rules the model has internalized. The experiments on the string-to-grid task are a concrete first step, and noting different characterizations across models is a reasonable observation. The main weakness is the lack of grounding for the central claim. The paper does not appear to include details on the specific complexity theory applied, validation that the generated programs correctly and minimally capture the mappings, or any quantitative results and error analysis. If the complexity score can be influenced by prompt style or code-writing ability rather than true compositional understanding, the method won't deliver on its promise. The stress-test concern about unvalidated mapping from program complexity to compositionality holds up based on the available description. This paper is for people developing evaluation techniques for LLMs, particularly around generalization. Readers interested in new ideas for probing model understanding could get some value from the perspective, even if they have to fill in the implementation details themselves. It shows honest engagement with the issues in the field. I would send it to peer review so that referees can push for the missing validation and controls.

Referee Report

2 major / 1 minor

Summary. The paper claims that compositional generalization tests for LLMs suffer from explainability defects (focusing only on outputs, not internal understanding) and combination leakage from dataset partitions. It proposes a partition-free alternative: prompting LLMs to generate programs encoding dataset mappings, then estimating compositionality via complexity-based theory applied to those programs. Experiments on a string-to-grid task are reported to reveal varied compositionality characterizations and deficiencies across advanced LLMs.

Significance. If the mapping from program complexity to compositionality is shown to be valid and bias-free, the approach could provide a more interpretable, split-independent probe of LLMs' rule comprehension, addressing a genuine gap in current evaluation practices. However, the absence of derivation details, error analysis, or calibration in the provided description limits immediate assessment of its potential impact.

major comments (2)

[Method / Experiments] The central claim that LLM-generated program complexity yields a valid estimate of compositionality understanding is load-bearing but unsupported by validation. No controls for program correctness, minimality, or equivalence classes are described, raising the risk that stylistic or fluency artifacts (rather than rule comprehension) drive the complexity scores.
[Experiments] The string-to-grid experiments report 'various characterizations' without calibration against ground-truth compositional models or standard benchmarks. This leaves open whether the complexity scores distinguish compositional from non-compositional mappings in a controlled way.

minor comments (1)

[Abstract] The abstract states the motivation and high-level method but supplies no derivation details, validation of the complexity measure, or quantitative results, making it difficult to assess soundness from the summary alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validation and calibration that we will address to strengthen the work. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Method / Experiments] The central claim that LLM-generated program complexity yields a valid estimate of compositionality understanding is load-bearing but unsupported by validation. No controls for program correctness, minimality, or equivalence classes are described, raising the risk that stylistic or fluency artifacts (rather than rule comprehension) drive the complexity scores.

Authors: We agree that explicit validation strengthens the central claim. While the manuscript applies complexity theory directly to generated programs as a proxy for rule understanding, we will revise to include: (1) a post-generation correctness filter that executes programs on held-out input-output pairs to confirm they implement the intended mapping; (2) explicit discussion of how the chosen complexity measure (approximating Kolmogorov complexity via program length) inherently penalizes non-minimal descriptions; and (3) an equivalence-class analysis showing that semantically identical rules expressed in different syntactic styles produce comparable complexity scores in our string-to-grid experiments. These additions mitigate concerns about stylistic artifacts driving the results. revision: yes
Referee: [Experiments] The string-to-grid experiments report 'various characterizations' without calibration against ground-truth compositional models or standard benchmarks. This leaves open whether the complexity scores distinguish compositional from non-compositional mappings in a controlled way.

Authors: The string-to-grid task was selected for its transparent primitive rules (e.g., independent control of color, shape, and position), allowing direct observation of compositionality. We acknowledge that the current results are primarily descriptive. In revision, we will add controlled calibration experiments: synthetic datasets with explicitly varied compositionality levels (from atomic rules to multi-rule compositions) and direct comparison of our complexity estimates against ground-truth measures. We will also report correlations with performance on established benchmarks such as SCAN to show that lower complexity scores align with stronger compositional generalization. This will provide the requested controlled distinction. revision: yes

Circularity Check

0 steps flagged

No circularity: external complexity theory applied to LLM-generated programs

full rationale

The derivation chain begins with prompting LLMs to emit programs encoding dataset mappings on a string-to-grid task, then applies an external complexity-based theory (description length or similar) to produce a compositionality estimate. No equation or step reduces the output to a fitted parameter, self-defined quantity, or self-citation chain; the estimate is computed from the generated programs rather than being presupposed by the inputs. The approach is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Without access to the full manuscript, specific free parameters, axioms, or invented entities cannot be identified; the approach appears to rest on an unspecified complexity-based theory whose assumptions are not stated in the abstract.

pith-pipeline@v0.9.0 · 5463 in / 1035 out tokens · 28954 ms · 2026-05-07T09:48:36.479909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 16 canonical work pages · 5 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Anthropic. 2024. https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf Claude 3.5 sonnet model card addendum

2024
[4]

Anthropic. 2025. https://assets.anthropic.com/m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf Claude 3.7 sonnet system card

2025
[5]

Courville

Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron C. Courville. 2019. https://openreview.net/forum?id=HkezXnA9YX Systematic generalization: What is required and can it be learned? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net

2019
[6]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review arXiv 2025
[7]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review arXiv 2025
[8]

Stanislas Dehaene, Fosca Al Roumi, Yair Lakretz, Samuel Planton, and Mathias Sabl \'e -Meyer. 2022. https://doi.org/10.1016/j.tics.2022.06.010 Symbols and mental programs: a hypothesis about human singularity . Trends in Cognitive Sciences, 26(9):751--766

work page doi:10.1016/j.tics.2022.06.010 2022
[9]

Eric Elmoznino, Thomas Jiralerspong, Yoshua Bengio, and Guillaume Lajoie. 2025. https://arxiv.org/abs/2410.14817 A complexity-based theory of compositionality . Preprint, arXiv:2410.14817

work page arXiv 2025
[10]

Google. 2024. https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#ceo-message Introducing gemini 2.0: our new ai model for the agentic era

2024
[11]

Google. 2025. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking Gemini 2.5: Our most intelligent ai model

2025
[12]

Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. 2020. https://doi.org/10.1613/JAIR.1.11674 Compositionality decomposed: How do neural networks generalise? J. Artif. Intell. Res., 67:757--795

work page doi:10.1613/jair.1.11674 2020
[13]

Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel, Christos Christodoulopoulos, Karim Lasri, Naomi Saphra, Arabella Sinclair, Dennis Ulmer, Florian Schottmann, Khuyagbaatar Batsuren, Kaiser Sun, Koustuv Sinha, Leila Khalatbari, Maria Ryskina, Rita Frieske, Ryan Cotterell, and Zhijing Jin. 2022. https://doi.org/1...

work page doi:10.48550/arxiv.2210.03050 2022
[14]

Janssen and Barbara H

Theo M.V. Janssen and Barbara H. Partee. 1997. https://doi.org/10.1016/B978-044481714-3/50011-4 Chapter 7 - compositionality . In Johan van Benthem and Alice ter Meulen , editors, Handbook of Logic and Language, pages 417--473. North-Holland, Amsterdam

work page doi:10.1016/b978-044481714-3/50011-4 1997
[15]

Daniel Keysers, Nathanael Sch \" a rli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. 2020. https://openreview.net/forum?id=SygcCnNKwr Measuring compositional generalization: A comprehensive method on realistic...

2020
[16]

Najoung Kim and Tal Linzen. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.731 COGS: A compositional generalization challenge based on semantic interpretation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020 , pages 9087--9105. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.731 2020
[17]

Andrei N Kolmogorov. 1965. http://alexander.shen.free.fr/library/Kolmogorov65_Three-Approaches-to-Information.pdf Three approaches to the quantitative definition ofinformation’ . Problems of information transmission, 1(1):1--7

1965
[18]

Lake and Marco Baroni

Brenden M. Lake and Marco Baroni. 2018. http://proceedings.mlr.press/v80/lake18a.html Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks . In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 , volume 80 of Pr...

2018
[19]

Zhaoyi Li, Gangwei Jiang, Hong Xie, Linqi Song, Defu Lian, and Ying Wei. 2024. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.576 Understanding and patching compositional reasoning in llms . In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024 , pages 9668--9688. Association for C...

work page doi:10.18653/v1/2024.findings-acl.576 2024
[20]

Kate McCurdy, Paul Soulos, Paul Smolensky, Roland Fernandez, and Jianfeng Gao. 2024. https://doi.org/10.18653/V1/2024.EMNLP-MAIN.524 Toward compositional behavior in neural models: A survey of current views . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024 , pages ...

work page doi:10.18653/v1/2024.emnlp-main.524 2024
[21]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alex...

work page internal anchor Pith review arXiv 2024
[22]

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

OpenAI. 2025. https://cdn.openai.com/o3-mini-system-card-feb10.pdf Openai o3-mini system card

2025
[24]

Peter Pagin and Dag Westerståhl. 2010. https://doi.org/10.1111/j.1747-9991.2009.00228.x Compositionality i: Definitions and variants . Philosophy Compass, 5(3):250--264

work page doi:10.1111/j.1747-9991.2009.00228.x 2010
[25]

Francis Jeffry Pelletier. 1994. https://doi.org/10.1007/BF00763644 The principle of semantic compositionality . Topoi, 13(1):11--24

work page doi:10.1007/bf00763644 1994
[26]

Qwen-Team. 2025. https://qwenlm.github.io/blog/qwq-32b/ Qwq-32b: Embracing the power of reinforcement learning

2025
[27]

Zoltán Gendler Szabó. 2004. http://seop.illc.uva.nl/entries/compositionality/ Compositionality . In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University

2004
[28]

Thadd\" a us Wiedemer, Prasanna Mayilvahanan, Matthias Bethge, and Wieland Brendel. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/15f6a10899f557ce53fe39939af6f930-Paper-Conference.pdf Compositional generalization from first principles . In Advances in Neural Information Processing Systems, volume 36, pages 6941--6960. Curran Associates, Inc

2023
[29]

Ziyao Xu and Houfeng Wang. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.36 SPOR: A comprehensive and practical evaluation method for compositional generalization in data-to-text generation . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024...

work page doi:10.18653/v1/2024.acl-long.36 2024
[30]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page internal anchor Pith review arXiv 2024