Compositional Generalization in Autoregressive Models via Logit Composition

Aakash Kumar; Emanuele Natale; Maria Sofia Bucarelli

arxiv: 2605.28304 · v1 · pith:PFHXQCNEnew · submitted 2026-05-27 · 💻 cs.LG

Compositional Generalization in Autoregressive Models via Logit Composition

Aakash Kumar , Maria Sofia Bucarelli , Emanuele Natale This is my paper

Pith reviewed 2026-06-29 14:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords compositional generalizationautoregressive modelslogit compositionprojective compositionmodel merginglength generalization

0 comments

The pith

Autoregressive models composed via logits are projective under a factorized-conditionals assumption, so each component keeps independent control over its output subspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a logit-based composition method for autoregressive models that draws from techniques used in diffusion models. Under the factorized-conditionals assumption, the composition is projective, meaning each model controls only its assigned part of the output distribution and does not interfere with the others. The projective property survives smooth reparameterizations of the output space, which yields a feature-space theorem. The same conditions also let the composed model keep its length-generalization behavior when the assumptions hold at the target length. These results identify when merging succeeds and when interactions stay stable.

Core claim

Under a factorized-conditionals assumption, logit composition of autoregressive models produces a projective result in which each component model preserves control over its own designated subspace of the output distribution and avoids interference from the others. The projective property is invariant under smooth reparameterizations of the output space. Composition also preserves length-generalizing behavior whenever the factorization assumptions and component guarantees hold uniformly at the target length.

What carries the argument

Logit composition under the factorized-conditionals assumption, which produces the projective property in the joint output distribution.

If this is right

Merging succeeds without interference precisely when the factorized-conditionals assumption is satisfied.
The feature-space theorem extends the same projective guarantee to transformed output representations.
Length generalization carries over to the merged model when the assumptions hold uniformly at longer sequences.
Stable interactions between merged models are assured only under the stated factorization conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support reliable combination of skills learned on separate tasks without retraining.
Testing on large language models would reveal how closely real conditional distributions match the factorization needed.
Design choices that encourage factorized conditionals might make future models easier to compose by default.

Load-bearing premise

The factorized-conditionals assumption must hold for the projective property to be guaranteed.

What would settle it

A direct check whether the composed distribution shows interference between subspaces precisely when the conditionals fail to factorize.

Figures

Figures reproduced from arXiv: 2605.28304 by Aakash Kumar, Emanuele Natale, Maria Sofia Bucarelli.

**Figure 2.** Figure 2: Illustration of two specialized models p1 and p2 satisfying factorized conditionals (Definition 1) and their composition pˆ under Equation 2. Each column represents a token coordinate in the generated sequence. Blue cells mark the active mask M1 where p1 differs from the background model pb, green cells mark the active mask M2 where p2 differs from pb, and red cells indicate coordinates on which the corre… view at source ↗

read the original abstract

Composing autoregressive models remains a core challenge in understanding how large language models can combine behaviors or skills learned across tasks. We introduce a new and principled composition strategy for autoregressive systems, inspired by composition methods developed for diffusion models. Under a factorized-conditionals assumption, we show that the resulting composition is projective: each component model preserves control over its own designated subspace of the output distribution avoiding interference between models. This property is further preserved under smooth reparameterizations of the output space, yielding a feature-space theorem. Finally, we show that composition preserves length-generalizing behavior when the factorization assumptions and component guarantees hold uniformly at the target length. These results provide a principled understanding of when model composition and merging succeed in autoregressive systems and identify conditions under which their interactions remain stable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a logit composition operator for autoregressive models that is projective under a factorized-conditionals assumption, plus extensions to features and length generalization.

read the letter

The core contribution is a composition rule that combines autoregressive models by operating on their logits. Under the factorized-conditionals assumption, the result stays projective: each original model keeps exclusive influence over its assigned part of the output distribution. The authors also prove this property survives smooth reparameterizations and that length-generalizing behavior carries over when the assumption and component guarantees hold at the target length.

These pieces are new relative to the diffusion-model literature they cite. The projectivity theorem and the length-generalization preservation result do not appear in the referenced prior work, and the derivations follow directly from the stated assumption without circularity.

The main limitation is the assumption itself. The paper conditions every claim on factorized conditionals plus uniform guarantees at target length, but it does not show how often this holds for models trained on real data or provide counter-examples where the assumption breaks. Without that, the theorems remain conditional and their scope for practical model merging stays unclear.

The work is aimed at researchers studying model composition and merging in language models. Readers who want formal statements about when composition avoids interference will find the theorems useful. The paper is coherent on its own terms and the math is presented cleanly, so it merits sending to referees even if the assumption requires more empirical grounding in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces logit composition for autoregressive models. Under a factorized-conditionals assumption, it claims the resulting composition is projective (each component preserves control over its designated subspace of the output distribution without interference). This property is preserved under smooth reparameterizations, yielding a feature-space theorem. Composition also preserves length-generalizing behavior when the factorization assumptions and component guarantees hold uniformly at the target length.

Significance. If the derivations are rigorous, the results supply a principled account of when and why composition/merging succeeds in autoregressive systems, identifying explicit conditions for non-interfering interactions and stable length generalization. This could inform practical model merging in LLMs.

major comments (2)

[Abstract / composition analysis] The factorized-conditionals assumption is invoked at the outset of the composition analysis and is required for the projective property; however, the manuscript provides no explicit derivation or proof that the stated logit composition yields projectivity under this assumption, which is load-bearing for all three central claims (projectivity, feature-space theorem, length generalization).
[feature-space theorem / length generalization section] The feature-space theorem and length-generalization result are both conditioned on the assumption holding uniformly at target length plus uniform component guarantees, but no verification of the assumption's scope for trained models or counter-example analysis is supplied, leaving the practical applicability of the theorems unassessed.

minor comments (2)

Notation for the composition operator and the precise definition of 'subspace of the output distribution' should be introduced with an equation or diagram for clarity.
[Introduction] The abstract states results hold 'when the factorization assumptions and component guarantees hold uniformly,' but the introduction could more explicitly contrast this conditional framing with unconditional claims sometimes made in the model-merging literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract / composition analysis] The factorized-conditionals assumption is invoked at the outset of the composition analysis and is required for the projective property; however, the manuscript provides no explicit derivation or proof that the stated logit composition yields projectivity under this assumption, which is load-bearing for all three central claims (projectivity, feature-space theorem, length generalization).

Authors: We agree that an explicit derivation would strengthen the presentation. The manuscript states the result under the assumption, but the step-by-step proof from the logit composition definition to projectivity can be expanded. In the revision, we will insert a dedicated subsection with the full derivation, demonstrating how the factorized conditionals ensure no interference in the designated subspaces. revision: yes
Referee: [feature-space theorem / length generalization section] The feature-space theorem and length-generalization result are both conditioned on the assumption holding uniformly at target length plus uniform component guarantees, but no verification of the assumption's scope for trained models or counter-example analysis is supplied, leaving the practical applicability of the theorems unassessed.

Authors: The results are theoretical and conditional on the stated assumptions. To address applicability, we will add a new subsection discussing the scope of the factorized-conditionals assumption in trained autoregressive models, including potential counter-examples where uniformity fails at longer lengths, and how this impacts the theorems' practical use. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation follows from explicit assumption

full rationale

The paper's central results (projective composition, feature-space theorem, length generalization) are all explicitly conditioned on the factorized-conditionals assumption stated at the outset of the analysis. No step reduces a claimed prediction or theorem to a fitted parameter, self-definition, or self-citation chain; the projective property is derived directly from the assumption rather than being presupposed by the result itself. The abstract and claims frame the findings as holding precisely when the assumption and uniform component guarantees hold, with no load-bearing self-citations or ansatzes smuggled in. This is a standard non-circular case where the derivation is self-contained against the stated premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on one domain assumption; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption factorized-conditionals assumption
Invoked to establish that the composition is projective and that each model controls its own subspace.

pith-pipeline@v0.9.1-grok · 5659 in / 1118 out tokens · 33747 ms · 2026-06-29T14:04:52.801329+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 10 canonical work pages · 4 internal anchors

[2]

Program Synthesis with Large Language Models

URLhttps://arxiv.org/abs/2108.07732. Joshua Belofsky. Token-level adaptation of lora adapters for downstream task generalization. In Proceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference, pages 168–172,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Evaluating Large Language Models Trained on Code

URLhttps://arxiv.org/abs/2107.03374. Alexandra Chronopoulou, Matthew E Peters, Alexander Fraser, and Jesse Dodge. Adaptersoup: Weight averaging to improve generalization of pretrained language models. InFindings of the Association for Computational Linguistics: EACL 2023, pages 2054–2063,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URL https://arxiv.org/abs/2110. 14168. Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, and Emanuele Rodolà. MASS: Merging through adaptive subspace selection.arXiv preprint arXiv:2504.05342,

work page arXiv
[7]

10 Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodolà

URLhttps://zenodo.org/records/12608602. 10 Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodolà. Task singular vectors: Reducing task interference in model merging. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 18695–18705. IEEE, June

work page arXiv
[8]

Showui: One vision-language- action model for GUI visual agent

doi: 10.1109/cvpr52734.2025.01742. URL http://dx. doi.org/10.1109/CVPR52734.2025.01742. Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. Merging experts into one: Improving computational efficiency of mixture of experts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14685–14691,

work page doi:10.1109/cvpr52734.2025.01742 2025
[10]

Measuring Mathematical Problem Solving With the MATH Dataset

URL https://arxiv.org/abs/ 2103.03874. Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

doi: 10.18653/v1/2023.acl-long.687

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URLhttps://aclanthology.org/2023.acl-long.687/. Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Chengqing Zong, Fei Xia, Wenjie Li, an...

work page doi:10.18653/v1/2023.acl-long.687 2023
[12]

doi: 10.18653/v1/2021.acl-long.522

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URLhttps://aclanthology.org/2021.acl-long.522/. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information ...

work page doi:10.18653/v1/2021.acl-long.522 2021
[13]

Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735,

George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman. Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735,

work page arXiv
[14]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clément Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

2020
[15]

emnlp-demos.6/

URL https://aclanthology.org/2020. emnlp-demos.6/. Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational c...

2020
[16]

, qk, qb] = (qb)Mb kY i=1 (qi)Mi

Therefore, Theorem 1 applied in the transformed feature space gives C[q1, . . . , qk, qb] = (qb)Mb kY i=1 (qi)Mi . By Lemma 1, C[q1, . . . , qk, qb] =C[D#p 1, . . . ,D#pk,D#p b] =D#C[p 1, . . . , pk, pb]. Sinceˆp=C[p 1, . . . , pk, pb], we obtain D#L ˆp= (D#Lpb)Mb kY i=1 (D#Lpi)Mi . This is exactly the desired feature-space factorization. Finally, since t...

2019
[17]

The implementation asserts that all tokenizers have the same vocabulary size and token-id mapping

to produce a merged model ˆp. The implementation asserts that all tokenizers have the same vocabulary size and token-id mapping. During generation it stores separate key-value caches for the base, math, and coding models, computes log ˆpt = logp math,t + logp coding,t −logp base,t, renormalizes with a softmax, and decodes greedily. We use MAX_NEW=512, SAM...

2024

[1] [2]

Program Synthesis with Large Language Models

URLhttps://arxiv.org/abs/2108.07732. Joshua Belofsky. Token-level adaptation of lora adapters for downstream task generalization. In Proceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference, pages 168–172,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [4]

Evaluating Large Language Models Trained on Code

URLhttps://arxiv.org/abs/2107.03374. Alexandra Chronopoulou, Matthew E Peters, Alexander Fraser, and Jesse Dodge. Adaptersoup: Weight averaging to improve generalization of pretrained language models. InFindings of the Association for Computational Linguistics: EACL 2023, pages 2054–2063,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [6]

URL https://arxiv.org/abs/2110. 14168. Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, and Emanuele Rodolà. MASS: Merging through adaptive subspace selection.arXiv preprint arXiv:2504.05342,

work page arXiv

[5] [7]

10 Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodolà

URLhttps://zenodo.org/records/12608602. 10 Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodolà. Task singular vectors: Reducing task interference in model merging. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 18695–18705. IEEE, June

work page arXiv

[6] [8]

Showui: One vision-language- action model for GUI visual agent

doi: 10.1109/cvpr52734.2025.01742. URL http://dx. doi.org/10.1109/CVPR52734.2025.01742. Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. Merging experts into one: Improving computational efficiency of mixture of experts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14685–14691,

work page doi:10.1109/cvpr52734.2025.01742 2025

[7] [10]

Measuring Mathematical Problem Solving With the MATH Dataset

URL https://arxiv.org/abs/ 2103.03874. Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [11]

doi: 10.18653/v1/2023.acl-long.687

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URLhttps://aclanthology.org/2023.acl-long.687/. Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Chengqing Zong, Fei Xia, Wenjie Li, an...

work page doi:10.18653/v1/2023.acl-long.687 2023

[9] [12]

doi: 10.18653/v1/2021.acl-long.522

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URLhttps://aclanthology.org/2021.acl-long.522/. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information ...

work page doi:10.18653/v1/2021.acl-long.522 2021

[10] [13]

Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735,

George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman. Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735,

work page arXiv

[11] [14]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clément Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

2020

[12] [15]

emnlp-demos.6/

URL https://aclanthology.org/2020. emnlp-demos.6/. Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational c...

2020

[13] [16]

, qk, qb] = (qb)Mb kY i=1 (qi)Mi

Therefore, Theorem 1 applied in the transformed feature space gives C[q1, . . . , qk, qb] = (qb)Mb kY i=1 (qi)Mi . By Lemma 1, C[q1, . . . , qk, qb] =C[D#p 1, . . . ,D#pk,D#p b] =D#C[p 1, . . . , pk, pb]. Sinceˆp=C[p 1, . . . , pk, pb], we obtain D#L ˆp= (D#Lpb)Mb kY i=1 (D#Lpi)Mi . This is exactly the desired feature-space factorization. Finally, since t...

2019

[14] [17]

The implementation asserts that all tokenizers have the same vocabulary size and token-id mapping

to produce a merged model ˆp. The implementation asserts that all tokenizers have the same vocabulary size and token-id mapping. During generation it stores separate key-value caches for the base, math, and coding models, computes log ˆpt = logp math,t + logp coding,t −logp base,t, renormalizes with a softmax, and decodes greedily. We use MAX_NEW=512, SAM...

2024