Compositional Generalization in Autoregressive Models via Logit Composition
Pith reviewed 2026-06-29 14:04 UTC · model grok-4.3
The pith
Autoregressive models composed via logits are projective under a factorized-conditionals assumption, so each component keeps independent control over its output subspace.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a factorized-conditionals assumption, logit composition of autoregressive models produces a projective result in which each component model preserves control over its own designated subspace of the output distribution and avoids interference from the others. The projective property is invariant under smooth reparameterizations of the output space. Composition also preserves length-generalizing behavior whenever the factorization assumptions and component guarantees hold uniformly at the target length.
What carries the argument
Logit composition under the factorized-conditionals assumption, which produces the projective property in the joint output distribution.
If this is right
- Merging succeeds without interference precisely when the factorized-conditionals assumption is satisfied.
- The feature-space theorem extends the same projective guarantee to transformed output representations.
- Length generalization carries over to the merged model when the assumptions hold uniformly at longer sequences.
- Stable interactions between merged models are assured only under the stated factorization conditions.
Where Pith is reading between the lines
- The method could support reliable combination of skills learned on separate tasks without retraining.
- Testing on large language models would reveal how closely real conditional distributions match the factorization needed.
- Design choices that encourage factorized conditionals might make future models easier to compose by default.
Load-bearing premise
The factorized-conditionals assumption must hold for the projective property to be guaranteed.
What would settle it
A direct check whether the composed distribution shows interference between subspaces precisely when the conditionals fail to factorize.
Figures
read the original abstract
Composing autoregressive models remains a core challenge in understanding how large language models can combine behaviors or skills learned across tasks. We introduce a new and principled composition strategy for autoregressive systems, inspired by composition methods developed for diffusion models. Under a factorized-conditionals assumption, we show that the resulting composition is projective: each component model preserves control over its own designated subspace of the output distribution avoiding interference between models. This property is further preserved under smooth reparameterizations of the output space, yielding a feature-space theorem. Finally, we show that composition preserves length-generalizing behavior when the factorization assumptions and component guarantees hold uniformly at the target length. These results provide a principled understanding of when model composition and merging succeed in autoregressive systems and identify conditions under which their interactions remain stable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces logit composition for autoregressive models. Under a factorized-conditionals assumption, it claims the resulting composition is projective (each component preserves control over its designated subspace of the output distribution without interference). This property is preserved under smooth reparameterizations, yielding a feature-space theorem. Composition also preserves length-generalizing behavior when the factorization assumptions and component guarantees hold uniformly at the target length.
Significance. If the derivations are rigorous, the results supply a principled account of when and why composition/merging succeeds in autoregressive systems, identifying explicit conditions for non-interfering interactions and stable length generalization. This could inform practical model merging in LLMs.
major comments (2)
- [Abstract / composition analysis] The factorized-conditionals assumption is invoked at the outset of the composition analysis and is required for the projective property; however, the manuscript provides no explicit derivation or proof that the stated logit composition yields projectivity under this assumption, which is load-bearing for all three central claims (projectivity, feature-space theorem, length generalization).
- [feature-space theorem / length generalization section] The feature-space theorem and length-generalization result are both conditioned on the assumption holding uniformly at target length plus uniform component guarantees, but no verification of the assumption's scope for trained models or counter-example analysis is supplied, leaving the practical applicability of the theorems unassessed.
minor comments (2)
- Notation for the composition operator and the precise definition of 'subspace of the output distribution' should be introduced with an equation or diagram for clarity.
- [Introduction] The abstract states results hold 'when the factorization assumptions and component guarantees hold uniformly,' but the introduction could more explicitly contrast this conditional framing with unconditional claims sometimes made in the model-merging literature.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We address the major comments point by point below.
read point-by-point responses
-
Referee: [Abstract / composition analysis] The factorized-conditionals assumption is invoked at the outset of the composition analysis and is required for the projective property; however, the manuscript provides no explicit derivation or proof that the stated logit composition yields projectivity under this assumption, which is load-bearing for all three central claims (projectivity, feature-space theorem, length generalization).
Authors: We agree that an explicit derivation would strengthen the presentation. The manuscript states the result under the assumption, but the step-by-step proof from the logit composition definition to projectivity can be expanded. In the revision, we will insert a dedicated subsection with the full derivation, demonstrating how the factorized conditionals ensure no interference in the designated subspaces. revision: yes
-
Referee: [feature-space theorem / length generalization section] The feature-space theorem and length-generalization result are both conditioned on the assumption holding uniformly at target length plus uniform component guarantees, but no verification of the assumption's scope for trained models or counter-example analysis is supplied, leaving the practical applicability of the theorems unassessed.
Authors: The results are theoretical and conditional on the stated assumptions. To address applicability, we will add a new subsection discussing the scope of the factorized-conditionals assumption in trained autoregressive models, including potential counter-examples where uniformity fails at longer lengths, and how this impacts the theorems' practical use. revision: yes
Circularity Check
No significant circularity; derivation follows from explicit assumption
full rationale
The paper's central results (projective composition, feature-space theorem, length generalization) are all explicitly conditioned on the factorized-conditionals assumption stated at the outset of the analysis. No step reduces a claimed prediction or theorem to a fitted parameter, self-definition, or self-citation chain; the projective property is derived directly from the assumption rather than being presupposed by the result itself. The abstract and claims frame the findings as holding precisely when the assumption and uniform component guarantees hold, with no load-bearing self-citations or ansatzes smuggled in. This is a standard non-circular case where the derivation is self-contained against the stated premises.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption factorized-conditionals assumption
Reference graph
Works this paper leans on
-
[2]
Program Synthesis with Large Language Models
URLhttps://arxiv.org/abs/2108.07732. Joshua Belofsky. Token-level adaptation of lora adapters for downstream task generalization. In Proceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference, pages 168–172,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Evaluating Large Language Models Trained on Code
URLhttps://arxiv.org/abs/2107.03374. Alexandra Chronopoulou, Matthew E Peters, Alexander Fraser, and Jesse Dodge. Adaptersoup: Weight averaging to improve generalization of pretrained language models. InFindings of the Association for Computational Linguistics: EACL 2023, pages 2054–2063,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
- [6]
-
[7]
URLhttps://zenodo.org/records/12608602. 10 Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodolà. Task singular vectors: Reducing task interference in model merging. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 18695–18705. IEEE, June
-
[8]
Showui: One vision-language- action model for GUI visual agent
doi: 10.1109/cvpr52734.2025.01742. URL http://dx. doi.org/10.1109/CVPR52734.2025.01742. Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. Merging experts into one: Improving computational efficiency of mixture of experts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14685–14691,
-
[10]
Measuring Mathematical Problem Solving With the MATH Dataset
URL https://arxiv.org/abs/ 2103.03874. Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
doi: 10.18653/v1/2023.acl-long.687
Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URLhttps://aclanthology.org/2023.acl-long.687/. Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Chengqing Zong, Fei Xia, Wenjie Li, an...
-
[12]
doi: 10.18653/v1/2021.acl-long.522
Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URLhttps://aclanthology.org/2021.acl-long.522/. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information ...
-
[13]
Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735,
George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman. Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735,
-
[14]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clément Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...
2020
-
[15]
emnlp-demos.6/
URL https://aclanthology.org/2020. emnlp-demos.6/. Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational c...
2020
-
[16]
, qk, qb] = (qb)Mb kY i=1 (qi)Mi
Therefore, Theorem 1 applied in the transformed feature space gives C[q1, . . . , qk, qb] = (qb)Mb kY i=1 (qi)Mi . By Lemma 1, C[q1, . . . , qk, qb] =C[D#p 1, . . . ,D#pk,D#p b] =D#C[p 1, . . . , pk, pb]. Sinceˆp=C[p 1, . . . , pk, pb], we obtain D#L ˆp= (D#Lpb)Mb kY i=1 (D#Lpi)Mi . This is exactly the desired feature-space factorization. Finally, since t...
2019
-
[17]
The implementation asserts that all tokenizers have the same vocabulary size and token-id mapping
to produce a merged model ˆp. The implementation asserts that all tokenizers have the same vocabulary size and token-id mapping. During generation it stores separate key-value caches for the base, math, and coding models, computes log ˆpt = logp math,t + logp coding,t −logp base,t, renormalizes with a softmax, and decodes greedily. We use MAX_NEW=512, SAM...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.