Grokking Finite-Dimensional Algebra

Guillaume Dumas; Guillaume Rabusseau; Pascal Jr Tikeng Notsawo

arxiv: 2602.19533 · v2 · pith:T4HD7RGAnew · submitted 2026-02-23 · 💻 cs.LG · cs.AI· math.RA

Grokking Finite-Dimensional Algebra

Pascal Jr Tikeng Notsawo , Guillaume Dumas , Guillaume Rabusseau This is my paper

Pith reviewed 2026-05-15 20:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.RA

keywords grokkingfinite-dimensional algebrasstructure tensorbilinear productneural network generalizationalgebraic structuresfinite fields

0 comments

The pith

Neural networks grok algebra multiplication once they recover the bilinear product from the structure tensor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that training a neural network to multiply elements of a finite-dimensional algebra reduces to learning the bilinear map encoded by the algebra's structure tensor. This setup extends earlier grokking results on groups to non-associative, non-commutative, and non-unital algebras. When the algebra is defined over a finite field, the model must discover discrete representations of the elements, producing the sudden shift from memorization to generalization. Properties of the algebra, such as commutativity and associativity, and properties of the tensor, such as sparsity and rank, control both whether and when this transition occurs. Successful generalization coincides with the emergence of latent embeddings that align with the algebra's own representation.

Core claim

Learning multiplication in a finite-dimensional algebra amounts to learning the bilinear product specified by the algebra's structure tensor. Grokking emerges naturally as models learn discrete representations for algebras over finite fields, and learning group operations is recovered as a special case.

What carries the argument

The structure tensor of the finite-dimensional algebra, which encodes the bilinear multiplication map and controls both the learning dynamics and the emergence of generalization.

If this is right

Commutativity, associativity, and unitality alter both the timing and reliability of the grokking transition.
Higher rank or denser structure tensors slow generalization and raise the final error floor.
Latent embeddings that match the algebra's representation predict generalization success.
Matrix-factorization bias explains grokking behavior for real algebras.
Group multiplication emerges as one instance of the same bilinear learning problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tensor-learning lens may clarify grokking in other structured tasks such as polynomial arithmetic or Lie-algebra operations.
Architectures that explicitly parameterize bilinear maps could reduce the memorization phase by injecting the expected tensor structure.
The finite-field case suggests that discretization pressure is a generic driver of sudden generalization whenever the target function is defined over a discrete domain.

Load-bearing premise

The models are recovering the algebra multiplication by identifying the structure tensor rather than exploiting some other shortcut that happens to match the target operation.

What would settle it

A trained model that reaches perfect test accuracy on the multiplication task while its internal activations remain uncorrelated with the algebra's structure tensor or its natural element embeddings.

read the original abstract

This paper investigates the grokking phenomenon, which refers to the sudden transition from a long memorization to generalization observed during neural networks training, in the context of learning multiplication in finite-dimensional algebras (FDA). While prior work on grokking has focused mainly on group operations, we extend the analysis to more general algebraic structures, including non-associative, non-commutative, and non-unital algebras. We show that learning group operations is a special case of learning FDA, and that learning multiplication in FDA amounts to learning a bilinear product specified by the algebra's structure tensor. For algebras over the reals, we connect the learning problem to matrix factorization with an implicit low-rank bias, and for algebras over finite fields, we show that grokking emerges naturally as models must learn discrete representations of algebraic elements. This leads us to experimentally investigate the following core questions: (i) how do algebraic properties such as commutativity, associativity, and unitality influence both the emergence and timing of grokking, (ii) how structural properties of the structure tensor of the FDA, such as sparsity and rank, influence generalization, and (iii) to what extent generalization correlates with the model learning latent embeddings aligned with the algebra's representation. Our work provides a unified framework for grokking across algebraic structures and new insights into how mathematical structure governs neural network generalization dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This extends grokking analysis from groups to finite-dimensional algebras by reducing multiplication to structure-tensor learning and mapping properties like associativity and tensor rank to generalization timing, but the experiments do not isolate the tensor mechanism from possible shortcuts.

read the letter

The core contribution is a clean reframing: learning multiplication in any finite-dimensional algebra reduces to learning the bilinear map given by its structure tensor, with groups as the special case. For algebras over finite fields this leads to the claim that grokking appears because the network must discover discrete algebraic representations. The experiments then vary commutativity, associativity, unitality, tensor sparsity, and rank, and report correlations with when grokking occurs plus alignment of learned embeddings with the algebra's representation theory. That produces new data points beyond the group-only literature, which is useful for the subfield even if the numbers are provisional from the abstract alone. The connection to implicit low-rank matrix factorization over the reals is also a straightforward and helpful link. The main limitation is that the reported correlations do not come with controls that preserve the multiplication table while breaking the tensor structure (or the reverse). Without those, it remains possible that the networks are latching onto some other function that agrees with the training examples rather than the bilinear product itself. The abstract poses the right questions but does not describe such isolation experiments, so the mechanistic interpretation stays suggestive rather than locked down. This is still worth referee time for anyone working on grokking or algebraic generalization in networks. The framing is coherent, the experimental axes are well-chosen, and the work supplies a broader task family that others can build on. I would send it to review with a request for the missing controls and clearer statistical reporting on the timing measurements.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates grokking during training of neural networks to learn multiplication in finite-dimensional algebras (FDAs). It frames this task as equivalent to learning the bilinear product encoded by the algebra's structure tensor, positioning group operations as a special case. The authors connect the real-field case to implicit low-rank matrix factorization and the finite-field case to acquisition of discrete representations. They experimentally examine three questions: the effects of algebraic properties (commutativity, associativity, unitality) on grokking emergence and timing; the influence of structure-tensor properties (sparsity, rank) on generalization; and the correlation between generalization and latent embeddings aligned with the algebra's representation.

Significance. If the central claims are substantiated, the work supplies a unified framework that extends grokking analysis from groups to a wider class of algebras and ties generalization dynamics to concrete algebraic invariants. This could clarify how mathematical structure shapes neural-network behavior beyond the specific setting of modular arithmetic.

major comments (2)

[Experimental investigation of core questions (i)–(iii)] The experimental design for questions (i)–(iii) lacks control tasks that preserve the input–output multiplication table on the training set while destroying the algebraic relations encoded in the structure tensor (or vice versa). Without such isolation, it is impossible to rule out that observed grokking and embedding correlations arise from any function agreeing with the table rather than from structure-tensor learning, which is load-bearing for the abstract’s central claim.
[Finite-field experiments and discussion of discrete representations] The assertion that grokking “emerges naturally” for algebras over finite fields because models must learn discrete representations is presented without quantitative evidence (e.g., embedding alignment metrics, ablation on representation discreteness, or comparison to continuous relaxations). This leaves the finite-field mechanism under-supported relative to the paper’s framing.

minor comments (1)

The abstract states that learning FDA multiplication “amounts to” learning the structure tensor, but the precise reduction (including any implicit assumptions on basis choice or field) is not restated in the experimental sections, making it difficult to map results back to the claimed equivalence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important ways to strengthen the isolation of our central claims. We address each major comment below and will incorporate the suggested controls and quantitative analyses in the revised manuscript.

read point-by-point responses

Referee: [Experimental investigation of core questions (i)–(iii)] The experimental design for questions (i)–(iii) lacks control tasks that preserve the input–output multiplication table on the training set while destroying the algebraic relations encoded in the structure tensor (or vice versa). Without such isolation, it is impossible to rule out that observed grokking and embedding correlations arise from any function agreeing with the table rather than from structure-tensor learning, which is load-bearing for the abstract’s central claim.

Authors: We agree that additional control tasks are necessary to isolate structure-tensor learning from mere table memorization. In the revision we will introduce experiments that preserve the exact input–output multiplication table on the training set while replacing the underlying structure tensor with one that agrees on those points but encodes different algebraic relations (for example, by randomizing the tensor entries outside the training support while keeping the observed products fixed). These controls will directly test whether grokking and embedding alignment depend on the specific algebraic structure rather than on any function consistent with the table. revision: yes
Referee: [Finite-field experiments and discussion of discrete representations] The assertion that grokking “emerges naturally” for algebras over finite fields because models must learn discrete representations is presented without quantitative evidence (e.g., embedding alignment metrics, ablation on representation discreteness, or comparison to continuous relaxations). This leaves the finite-field mechanism under-supported relative to the paper’s framing.

Authors: We acknowledge that the finite-field mechanism requires stronger quantitative backing. In the revised version we will add (i) explicit embedding alignment metrics that measure the distance of learned representations to the nearest discrete algebraic elements, (ii) ablations that vary the degree of discreteness enforced during training, and (iii) direct comparisons against continuous relaxations of the same algebras. These additions will provide measurable evidence that grokking timing correlates with the acquisition of discrete representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central framing is standard algebraic definition with independent experimental questions

full rationale

The paper's core statement that learning FDA multiplication amounts to learning the bilinear product from the structure tensor is a direct restatement of the definition of the structure tensor in finite-dimensional algebra, not a derived prediction or fitted claim. Experimental questions on how commutativity, associativity, sparsity, rank, and latent embeddings influence grokking are posed independently without reducing to self-citations, uniqueness theorems from the authors, or inputs called predictions. No load-bearing step in the provided abstract or framing equates a result to its own inputs by construction; the derivation chain remains self-contained against external algebraic benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the modeling assumption that multiplication in an FDA is exactly captured by a bilinear map given by the structure tensor, plus standard neural-network training assumptions about optimization and representation learning. No new mathematical axioms are introduced beyond those of finite-dimensional algebra over R or finite fields.

axioms (1)

domain assumption Multiplication in a finite-dimensional algebra is a bilinear operation fully specified by its structure tensor.
Invoked in the abstract when stating that learning multiplication amounts to learning the bilinear product.

pith-pipeline@v0.9.0 · 5548 in / 1368 out tokens · 19429 ms · 2026-05-15T20:19:51.927536+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

learning multiplication in FDA amounts to learning a bilinear product specified by the algebra's structure tensor... RiRj = ∑k Cijk Rk
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_equiv_Nat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

grokking emerges naturally as models must learn discrete representations of algebraic elements

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.