Improving LLM Final Representations with Inter-Layer Geometry

Eyal Blyachman; Maya Bechler-Speicher; Tom Ulanovski

arxiv: 2603.22665 · v3 · submitted 2026-03-24 · 💻 cs.CL · cs.LG

Improving LLM Final Representations with Inter-Layer Geometry

Tom Ulanovski , Eyal Blyachman , Maya Bechler-Speicher This is my paper

Pith reviewed 2026-05-15 01:24 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords LLM representationsinter-layer aggregationGraph Neural NetworkCayley graphdownstream tasksfew-shot learningfrozen modelsrepresentation learning

0 comments

The pith

A lightweight GNN on a Cayley graph of LLM layers produces stronger final representations for downstream tasks than single-layer or attention methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the final layer of an LLM often misses complementary signals present in earlier layers for many prediction tasks. A basic graph neural network linking all layers already beats the cost and performance of layer search or attention-based aggregation. The central advance replaces the dense graph with a Cayley graph over the group SL(2, Zn) inside the Cayley-Encoder, creating a sparse, regular, low-diameter structure that limits overfitting while permitting effective cross-layer communication. Across 13 tasks and 9 models this yields accuracy gains up to 40 percentage points, adds at most 0.1 percent extra parameters, works in few-shot regimes, and exceeds LoRA fine-tuning on a frozen base model. Layer-contribution analysis supports the claim that multiple layers supply useful information rather than the final layer alone.

Core claim

The authors establish that replacing a fully connected inter-layer graph with a Cayley graph over SL(2, Zn) inside a lightweight GNN aggregates complementary signals from all LLM layers into a final representation that improves downstream accuracy, efficiency, and few-shot performance while adding negligible parameters and outperforming both layer selection and attention baselines.

What carries the argument

The Cayley-Encoder: a graph neural network whose message-passing topology is a Cayley graph over the special linear group SL(2, Zn) applied to the sequence of layer representations extracted from a frozen LLM.

If this is right

Accuracy gains reach up to 40 percentage points over baselines on 13 tasks.
Parameter overhead stays at most 0.1 percent relative to the base LLM.
Performance remains strong in few-shot learning settings.
The method surpasses LoRA fine-tuning when LLM weights are held frozen.
Multiple layers contribute meaningfully to the final prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparse group-structured aggregation could be tested on other deep sequence architectures to check whether distributed layer signals are a general phenomenon.
Because the topology is fixed and parameter-free, the approach may serve as a lightweight diagnostic for how task information is encoded at different depths.
If the low-diameter property scales with model depth, the construction could support efficient aggregation in much deeper networks without additional tuning.

Load-bearing premise

Complementary task-relevant signals exist across LLM layers and can be aggregated effectively by a GNN on a fixed Cayley graph without losing critical information or needing task-specific retuning.

What would settle it

If accuracy on any evaluated task falls to or below the level obtained by using only the final-layer representation, the benefit of structured inter-layer aggregation would be refuted.

read the original abstract

The standard in LLM-based prediction is to use the final-layer representation as the input to a downstream predictor. However, intermediate layers may encode complementary task-relevant signals. Existing approaches therefore either search for the best layer for each task or apply expensive attention-based mechanisms to learn inter-layer aggregation. In this work, we first show that such complexity is unnecessary: a lightweight Graph Neural Network over a fully connected graph of LLM layers is more efficient and achieves significantly stronger predictive performance than existing approaches. We then introduce the Cayley-Encoder, which further improves both efficiency and predictive performance by replacing the fully connected graph with a Cayley graph over SL(2, Zn). These Cayley graphs provide a mathematically grounded topology that is sparse, regular by construction, and has low diameter. This enables effective communication across layers while constraining the aggregation structure and reducing the risk of GNN overfitting. In an evaluation of Cayley-Encoder across 13 tasks and 9 LLMs, Cayley-Encoder consistently outperforms baselines, achieving improvements of up to 40 percentage points in accuracy, while introducing at most 0.1% additional parameters relative to the LLM size. We further show that Cayley-Encoder is effective in few-shot regimes. Finally, we show that Cayley-Encoder outperforms LoRA fine-tuning while operating on the frozen LLM. We conclude with an explainability analysis showing that multiple layers contribute meaningfully to the final prediction, supporting our hypothesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Cayley-Encoder idea is practical but the SL(2, Zn) construction runs into a node-count mismatch that undercuts the claimed group-derived regularity.

read the letter

The paper shows that a basic GNN over LLM layers already beats attention-based aggregation and best-layer search, then swaps in a Cayley graph over SL(2, Zn) for further gains. The reported results are consistent across 13 tasks and 9 models, with accuracy lifts up to 40 points, at most 0.1 percent extra parameters, and better few-shot performance than the frozen LLM alone. It also edges out LoRA while keeping the base model untouched, and the explainability check indicates multiple layers actually contribute to the final output. That combination of low overhead and measurable lift is the useful part for anyone who wants to use intermediate representations without heavy fine-tuning or search overheads. The specific replacement of a dense graph with the Cayley topology is the new element relative to the baselines mentioned. The construction is meant to give sparsity, regularity, and low diameter for free from the group structure, which in principle constrains the GNN and reduces overfitting risk. The empirical pattern supports the hypothesis that complementary signals sit across layers and can be aggregated effectively. The soft spot is the topology itself. SL(2, Zn) has orders 6, 24, 120, and so on; none equal the 12 or 32 layers common in the models tested. If the paper uses an induced subgraph, a different group, or some re-indexing to hit the right node count, then the regularity and diameter guarantees no longer follow directly from the group. In that case the observed advantage could be explained by generic sparsity or GNN capacity rather than the advertised inter-layer geometry. I would want the methods section to spell out the exact construction and show that the mathematical properties still hold. The abstract is light on statistical tests and exact data splits, so those details will need checking too. This is aimed at people working on efficient layer aggregation for classification and few-shot tasks. It is coherent enough on its own terms to deserve a serious referee, even though the group-theory justification needs tightening.

Referee Report

2 major / 3 minor

Summary. The manuscript claims that a lightweight GNN over a fully connected graph of LLM layers outperforms existing layer-search or attention-based aggregation methods for downstream tasks, and that the Cayley-Encoder further improves results by substituting a Cayley graph over SL(2, Zn) that is sparse, regular by construction, and low-diameter. This yields consistent gains (up to 40pp accuracy) across 13 tasks and 9 LLMs with at most 0.1% added parameters, works in few-shot regimes, beats LoRA on frozen LLMs, and is supported by an explainability analysis showing multi-layer contributions.

Significance. If the empirical gains are robust and the topological properties of the Cayley graph are preserved in the implemented construction, the work would demonstrate a parameter-efficient, mathematically motivated alternative to complex inter-layer mechanisms in LLMs. The minimal overhead and outperformance over fine-tuning baselines could be practically useful for frozen-model settings, while the multi-layer explainability supports the hypothesis of complementary signals across layers.

major comments (2)

[Cayley-Encoder construction] Cayley-Encoder construction (abstract and method sections): the claim that the Cayley graph over SL(2, Zn) is 'regular by construction' and has 'low diameter' relies on the group structure, but |SL(2, Z_n)| equals 6 (n=2), 24 (n=3), 120 (n=4), etc., and never matches typical LLM layer counts such as 12 or 32. If an auxiliary construction (induced subgraph, re-indexing, or different group) is used to match the layer count, the regularity and diameter guarantees no longer follow directly from the group, so observed gains may be attributable to generic sparsity or GNN capacity rather than the advertised inter-layer geometry.
[Evaluation] Evaluation section: the abstract reports consistent outperformance with large effect sizes, but provides no details on statistical tests, exact baseline implementations, data splits, or run-to-run variance. This leaves the central claim of superiority only partially verifiable and weakens confidence in the reported improvements of up to 40 percentage points.

minor comments (3)

[Abstract] Abstract: the phrase 'up to 40 percentage points in accuracy' should specify the task, LLM, and baseline for transparency.
The manuscript should include a brief comparison table of graph properties (sparsity, diameter, regularity) between the fully connected graph, the Cayley graph, and any auxiliary construction actually used.
Clarify the exact GNN architecture (number of layers, message-passing function, readout) and how node features are initialized from LLM layer representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on the Cayley-Encoder construction and evaluation details. We have revised the manuscript to provide greater clarity and additional experimental rigor, addressing both major concerns.

read point-by-point responses

Referee: [Cayley-Encoder construction] Cayley-Encoder construction (abstract and method sections): the claim that the Cayley graph over SL(2, Zn) is 'regular by construction' and has 'low diameter' relies on the group structure, but |SL(2, Z_n)| equals 6 (n=2), 24 (n=3), 120 (n=4), etc., and never matches typical LLM layer counts such as 12 or 32. If an auxiliary construction (induced subgraph, re-indexing, or different group) is used to match the layer count, the regularity and diameter guarantees no longer follow directly from the group, so observed gains may be attributable to generic sparsity or GNN capacity rather than the advertised inter-layer geometry.

Authors: We appreciate this observation and acknowledge that the manuscript could have been clearer on the exact mapping. For each LLM with L layers, we select the smallest n such that |SL(2, Z_n)| >= L (e.g., n=3 for L=12 since |SL(2,3)|=24, n=4 for L=32 since |SL(2,4)|=120). We then construct the Cayley graph on the full group and induce the subgraph on the first L group elements, with edges preserved if both endpoints are in the subset. While this induced subgraph is not guaranteed to be regular, it inherits the low-diameter property approximately due to the original graph's small diameter (typically 2-3 for these groups), and the generators ensure structured connectivity. In the revised version, we have expanded the method section with a precise algorithmic description of this construction, including the chosen n values for all 9 models, a proof sketch that the diameter remains low (at most original diameter +1), and additional ablations comparing to random regular graphs of same sparsity to isolate the benefit of the Cayley structure. This supports that the gains stem from the specific geometry rather than generic sparsity. revision: yes
Referee: [Evaluation] Evaluation section: the abstract reports consistent outperformance with large effect sizes, but provides no details on statistical tests, exact baseline implementations, data splits, or run-to-run variance. This leaves the central claim of superiority only partially verifiable and weakens confidence in the reported improvements of up to 40 percentage points.

Authors: We agree that these details are essential for reproducibility and confidence in the results. In the revised manuscript, we have substantially expanded the evaluation section to include: (i) statistical tests (two-tailed paired t-tests across tasks, with p < 0.01 reported for all main comparisons); (ii) precise specifications of all baselines, including code-level details on how layer-search selects the best layer on validation and how attention-based aggregation is implemented; (iii) full data split information, citing the exact datasets, train/validation/test proportions, and any preprocessing; and (iv) mean and standard deviation over 5 independent runs with different random seeds for every reported metric. These additions make the up-to-40pp gains fully verifiable and demonstrate that the improvements are statistically significant and consistent across runs. revision: yes

Circularity Check

0 steps flagged

No significant circularity: topology from group theory, gains from external evaluation

full rationale

The derivation introduces a GNN on a fully-connected layer graph as a baseline, then replaces it with a Cayley graph over SL(2, Zn) whose nodes, edges, regularity, and diameter are defined directly from the group structure and chosen generators. These properties are not fitted to data or derived from the target performance metric. Performance improvements are measured by direct empirical comparison on 13 tasks across 9 LLMs, with no step that renames a fitted quantity as a prediction or reduces the central claim to a self-citation. The construction remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard properties of Cayley graphs and the empirical hypothesis that layer representations are complementary; no free parameters or invented physical entities are introduced in the abstract.

axioms (1)

standard math Cayley graphs over SL(2, Zn) are sparse, regular, and have low diameter
Invoked to justify the topology choice for efficient GNN message passing across layers.

invented entities (1)

Cayley-Encoder no independent evidence
purpose: GNN aggregator that uses Cayley graph topology instead of fully connected graph for inter-layer communication
New component introduced to constrain aggregation structure and reduce overfitting risk

pith-pipeline@v0.9.0 · 5561 in / 1324 out tokens · 50028 ms · 2026-05-15T01:24:27.148008+00:00 · methodology

Improving LLM Final Representations with Inter-Layer Geometry

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)