Same Graph, Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings

Aaron Thomas; Laurits Fredsgaard; Mahito Sugiyama; Michael Riis Andersen; Mikkel N. Schmidt

arxiv: 2604.05613 · v1 · submitted 2026-04-07 · 💻 cs.LG

Same Graph, Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings

Laurits Fredsgaard , Aaron Thomas , Michael Riis Andersen , Mikkel N. Schmidt , Mahito Sugiyama This is my paper

Pith reviewed 2026-05-10 19:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords autoregressive graph generationlinearization uncertaintypermutation equivalencegraph likelihood calibrationmolecular graph generationSENT encodingsexpected calibration error

0 comments

The pith

Autoregressive graph generators assign inconsistent likelihoods to the same graph under different valid orderings, revealing they learn training sequences rather than graph structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that autoregressive graph generators can only produce meaningful likelihoods if those likelihoods remain consistent across all equivalent ways to order the same graph as a sequence. It defines Linearization Uncertainty as the coefficient of variation in negative log-likelihood across multiple SENT linearizations of one graph. Experiments across linearization strategies show that models trained on fixed biased orderings achieve low error only on their training order but suffer expected calibration error two orders of magnitude higher when the order is randomized, indicating the models have memorized the linearization instead of the graph. On the QM9 molecular dataset, raw NLL correlates poorly with actual stability while Linearization Uncertainty correlates strongly, suggesting permutation-based metrics are more reliable for evaluating generated graphs.

Core claim

Autoregressive graph generators assign varying negative log-likelihoods to different linearizations of the identical graph when trained on biased orderings. This variation shows the models have learned the specific training linearization rather than the underlying graph. The coefficient of variation across equivalent SENT linearizations, called Linearization Uncertainty, serves as a more reliable evaluation metric than raw likelihood, achieving an AUC of 0.85 versus 0.43 for predicting molecular stability on QM9.

What carries the argument

Linearization Uncertainty (LU), the coefficient of variation of assigned negative log-likelihood across multiple equivalent SENT linearizations of the same graph.

If this is right

Biased-order training produces models with expected calibration error two orders of magnitude higher under random permutation than under their native ordering.
Raw NLL from such models is negatively correlated with molecular stability on QM9, while Linearization Uncertainty is positively correlated and yields higher AUC.
Permutation-based evaluation supplies a more reliable quality check for generated graphs than standard likelihood.
Consistent likelihood across linearizations requires training or evaluation strategies that expose the model to multiple orderings of each graph.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training autoregressive graph generators with randomly sampled linearizations at each step could reduce Linearization Uncertainty and improve generalization to new orderings.
The same inconsistency may appear in other autoregressive models over objects with multiple sequential encodings, such as trees or sequences with equivalent reorderings.
Linearization Uncertainty could be added as a regularization term to encourage the model to produce stable likelihoods across permutations.
Inherently permutation-invariant generators might be developed by replacing sequential autoregressive construction with order-agnostic mechanisms.

Load-bearing premise

The multiple SENT linearizations are truly equivalent representations of the identical graph, so variation in assigned NLL reflects the model's failure to learn the graph structure rather than differences in the linearization process.

What would settle it

Train a model by sampling random SENT linearizations for each graph at every training step, then check whether expected calibration error under random permutations falls by at least one order of magnitude compared to biased-order training.

Figures

Figures reproduced from arXiv: 2604.05613 by Aaron Thomas, Laurits Fredsgaard, Mahito Sugiyama, Michael Riis Andersen, Mikkel N. Schmidt.

**Figure 1.** Figure 1: LU AUC vs. number of permutations K (QM9). ROC-AUC for predicting molecular stability from LU, per linearization strategy (mean ± std across 3 seeds). Even at K=2 the signal far exceeds the Generation NLL baseline (AUC = 0.43). 4 Conclusion and Discussion Because SENT encodings are permutation-equivalent, any non-zero CV of NLL across orderings is a violation of graph-consistency, not a consequence of enc… view at source ↗

**Figure 2.** Figure 2: Overfitting vs. Generative Quality in Data-Scarce Regimes. Dual-axis plot showing validation NLL/token (solid, left axis) and VUN (dashed, right axis) for Planar (N=128, seed 0). For all biased strategies, NLL rises after ∼5k steps (memorization) while VUN continues to climb, driven by Validity alone. Only Random Order achieves sustained improvement in both axes. See Appendix D.2 for the decomposition of V… view at source ↗

**Figure 3.** Figure 3: QM9 subset-size sweep (mean ± std, 3 seeds). Generative quality across Ntrain ∈ {128, 1000, 10000}. Uniqueness collapses for biased strategies at small N, recovering only at N=10,000. Novelty remains consistently lower for biased strategies at all sizes. D.2 VUN Component Curves for Planar [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Decomposition of VUN into Validity, Uniqueness, and Novelty (Planar, N=128, seed 0). Each panel shares the NLL curve (faded solid lines, left axis) with the main figure. Validity (left): Min-Degree First is the only strategy where Validity visibly suffers, indicating incomplete grammar acquisition. Uniqueness (center): all biased strategies generate repeated outputs over time, with Anchor Expansion degradi… view at source ↗

**Figure 5.** Figure 5: Full Diversity Saturation Grid. Diversity metrics across Planar (top), QM9 Subset (middle), and QM9 Full (bottom). Random Order maintains higher sequence diversity across all scales; biased strategies saturate early [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Full cross-evaluation breakdown (QM9, K=32, 3-seed mean ± std). Rows = training strategy; columns = evaluation strategy. Diagonal cells (native evaluation) are well-calibrated and low-NLL for all strategies. Off-diagonal cells expose the “specialist” brittleness of biased strategies: NLL/tok, linearization uncertainty, and ECE all increase substantially when the evaluation ordering differs from the train… view at source ↗

**Figure 7.** Figure 7: reports the predictive power of per-molecule permutation-based metrics for chemical stability on QM9, averaged across four strategies and three seeds. To compute the ROC-AUC without training a classifier, we use molecular stability (valency compliance) as the positive binary label and the raw metric as the continuous prediction score. We negate the metric values prior to computation so that lower scores pr… view at source ↗

**Figure 8.** Figure 8: QM9: Sequence Length Distribution (Test Set vs. Generated). Distribution of sequence lengths for algorithmically-linearized QM9 test graphs and model-generated sequences. Generated sequences are systematically longer across all strategies [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Mechanistic Analysis of Sequence Length. Top: Distribution of chordal back-references per molecule for test graphs and model-generated sequences (across 3 seeds). Despite similar node and edge counts, model-generated sequences incur ≈ 2× more back-references, explaining the length gap in [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Autoregressive graph generators define likelihoods via a sequential construction process, but these likelihoods are only meaningful if they are consistent across all linearizations of the same graph. Segmented Eulerian Neighborhood Trails (SENT), a recent linearization method, converts graphs into sequences that can be perfectly decoded and efficiently processed by language models, but admit multiple equivalent linearizations of the same graph. We quantify violations in assigned negative log-likelihood (NLL) using the coefficient of variation across equivalent linearizations, which we call Linearization Uncertainty (LU). Training transformers under four linearization strategies on two datasets, we show that biased orderings achieve lower NLL on their native order but exhibit expected calibration error (ECE) two orders of magnitude higher under random permutation, indicating that these models have learned their training linearization rather than the underlying graph. On the molecular graph benchmark QM9, NLL for generated graphs is negatively correlated with molecular stability (AUC $=0.43$), while LU achieves AUC $=0.85$, suggesting that permutation-based evaluation provides a more reliable quality check for generated molecules. Code is available at https://github.com/lauritsf/linearization-uncertainty

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines LU as variation in NLL across SENT linearizations and shows models can learn order artifacts instead of graphs, but the key claim that this variation proves failure to capture graph structure lacks justification.

read the letter

The core observation is that autoregressive graph generators trained on fixed linearizations assign lower NLL to their training order but show much higher calibration error on random permutations of the same graphs. LU, defined as the coefficient of variation of those NLL values, then correlates better with molecular stability on QM9 than raw NLL does (AUC 0.85 vs 0.43). The experiments compare four linearization strategies and report concrete ECE gaps of two orders of magnitude, with code released for inspection.

Referee Report

2 major / 1 minor

Summary. The paper introduces Linearization Uncertainty (LU) as the coefficient of variation in negative log-likelihood (NLL) assigned by autoregressive graph generators across multiple equivalent Segmented Eulerian Neighborhood Trails (SENT) linearizations of the same graph. Experiments train transformers on four linearization strategies across two datasets and report that biased orderings yield lower native-order NLL but two-orders-of-magnitude higher expected calibration error (ECE) under random permutation; on QM9, NLL correlates poorly with molecular stability (AUC=0.43) while LU achieves AUC=0.85, with code released at the cited GitHub repository.

Significance. If the interpretation holds, the work identifies a previously under-appreciated failure mode in likelihood-based evaluation of autoregressive graph models and supplies a practical, permutation-based diagnostic that better tracks downstream properties such as molecular stability. The open-source code and concrete AUC/ECE numbers constitute reproducible strengths that could affect evaluation protocols in graph generation.

major comments (2)

[Abstract] Abstract and the central claim: the assertion that biased models 'have learned their training linearization rather than the underlying graph' requires that distinct valid SENT linearizations are probabilistically equivalent (i.e., a graph-aware model must produce identical NLL values). No derivation is supplied showing that the autoregressive product ∏ P(token_t | tokens_<t) is invariant across different Eulerian trails that decode to the identical graph; different conditionals can arise even when the final graph is the same.
[Results] Experimental claims on QM9 and the second dataset: the reported AUC values (0.43 for NLL, 0.85 for LU) and ECE differences rest on the assumption that the multiple SENT linearizations are strictly equivalent representations. Without explicit verification that the linearization process itself does not introduce systematic differences in the conditional sequence, the variation in NLL cannot be unambiguously attributed to linearization learning versus graph structure.

minor comments (1)

[Abstract] The abstract refers to 'two datasets' but names only QM9; the identity and statistics of the second dataset should be stated explicitly in the abstract or early methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We provide point-by-point responses to the major comments and outline the revisions we will make to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract and the central claim: the assertion that biased models 'have learned their training linearization rather than the underlying graph' requires that distinct valid SENT linearizations are probabilistically equivalent (i.e., a graph-aware model must produce identical NLL values). No derivation is supplied showing that the autoregressive product ∏ P(token_t | tokens_<t) is invariant across different Eulerian trails that decode to the identical graph; different conditionals can arise even when the final graph is the same.

Authors: We agree that the manuscript would benefit from a clearer justification. While we do not assert that NLL must be exactly identical (as different sequences can have different probabilities even for the same graph), our use of the coefficient of variation in LU measures the relative consistency. A graph-aware model should not exhibit large discrepancies in assigned likelihoods for equivalent representations. We will add a new subsection in the methods or discussion providing a conceptual argument for why low LU is expected for structure-learning models, including why the autoregressive conditionals should lead to similar NLL when the underlying graph is the same. This addresses the lack of derivation by offering an intuitive and empirical grounding. revision: yes
Referee: [Results] Experimental claims on QM9 and the second dataset: the reported AUC values (0.43 for NLL, 0.85 for LU) and ECE differences rest on the assumption that the multiple SENT linearizations are strictly equivalent representations. Without explicit verification that the linearization process itself does not introduce systematic differences in the conditional sequence, the variation in NLL cannot be unambiguously attributed to linearization learning versus graph structure.

Authors: We acknowledge the need for explicit verification. In the revised version, we will include an analysis of the SENT linearization process, demonstrating that all linearizations decode to the identical graph and that sequence-level statistics (such as average length and token frequencies) are comparable across strategies. Additionally, we will report the variance of NLL within the same linearization type to isolate the effect of different linearizations. These additions will ensure that the observed differences in LU and the superior correlation of LU with molecular stability can be confidently attributed to the model's sensitivity to linearization rather than artifacts of the encoding process. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical definitions and measurements are self-contained

full rationale

The paper defines Linearization Uncertainty (LU) directly as the coefficient of variation of observed NLL values across multiple SENT linearizations of the same graph and reports all key results (NLL comparisons, ECE under permutation, AUC correlations on QM9) as direct experimental outcomes from training and evaluating autoregressive models. No equation or derivation reduces these quantities to fitted parameters or prior results by construction. The central claims rest on measurable data variation rather than any self-referential loop, self-citation chain, or ansatz smuggled through prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that SENT produces multiple truly equivalent linearizations of any graph and that invariance of likelihood across them is required for meaningful autoregressive modeling.

axioms (1)

domain assumption Multiple SENT linearizations of the same graph are equivalent for the purpose of likelihood assignment
Invoked to justify that variation in NLL indicates model failure to capture graph structure rather than linearization differences.

invented entities (1)

Linearization Uncertainty (LU) no independent evidence
purpose: Quantify inconsistency of negative log-likelihood across permutation-equivalent linearizations
Newly introduced metric computed as coefficient of variation; no independent evidence provided beyond the reported correlations.

pith-pipeline@v0.9.0 · 5527 in / 1432 out tokens · 50921 ms · 2026-05-10T19:46:01.534837+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Autoregressive models on permutation-invariant objects such as graphs and molecules need to linearize them into sequences before applying the chain rule of probability... any distribution over graphs induces equal likelihood across all linearizations of G.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

S., and Karpatne, A

Bu, J., Mehrab, K. S., and Karpatne, A. (2023). Let there be order: Rethinking ordering in autoregres- sive graph generation

work page 2023
[2]

Chen, D., Krimmel, M., and Borgwardt, K. (2025). Flatten graphs as sequences: Transformers are scal- able graph generators. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems

work page 2025
[3]

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 1321–1330. JMLR.org

work page 2017
[4]

Welling, M. (2022). Equivariant diffusion for molecule generation in 3D. In Chaudhuri, K.,

work page 2022
[5]

Krimmel, M., Hartout, P., Borgwardt, K., and Chen, D. (2025). Polygraph discrepancy: a classifier-based metric for graph generation

work page 2025
[6]

Martinkus, K., Loukas, A., Perraudin, N., and Wat- tenhofer, R. (2022). Spectre: Spectral conditioning helps to overcome the expressivity limits of one-shot graph generators. InInternational Conference on Machine Learning, pages 15159–15179. PMLR

work page 2022
[7]

Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., and Klambauer, G. (2018). Fr´ echet ChemNet dis- tance: A metric for generative models for molecules in drug discovery.Journal of Chemical Information and Modeling, 58(9):1736–1741

work page 2018
[8]

Lilienfeld, O. A. (2014). Quantum chemistry struc- tures and properties of 134 kilo molecules.Scientific Data, 1:140022

work page 2014
[9]

and Frossard, P

Vignac, C. and Frossard, P. (2022). Top-n: Equivari- ant set and graph generation without exchangeabil- ity. InInternational Conference on Learning Repre- sentations

work page 2022
[10]

Cevher, V., and Frossard, P. (2023). Digress: Dis- crete denoising diffusion for graph generation. In The Eleventh International Conference on Learning Representations

work page 2023
[11]

Vinyals, O., Bengio, S., and Kudlur, M. (2016). Order matters: Sequence to sequence for sets. In Bengio, Y. and LeCun, Y., editors,4th International Con- ference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings

work page 2016
[12]

Wang, Y., Wang, J., Cao, Z., and Barati Farimani, A. (2022). Molecular contrastive learning of represen- tations via graph neural networks.Nature Machine Intelligence, 4:279–287

work page 2022
[13]

Leskovec, J. (2018). Graphrnn: Generating real- istic graphs with deep auto-regressive models. In International conference on machine learning, pages 5708–5717. PMLR. Same Graph, Different Likelihoods: Supplementary Materials A Metric Definitions All percentage metrics (Validity, Uniqueness, Novelty, Atm. Stable, Mol. Stable) are reported as fractions of ...

work page 2018
[14]

Note that while biased strategies exhibit lower Novelty compared to Random Order, we do not consider this a degradation in generative quality

Fredsgaard, Thomas, Andersen, Schmidt, and Sugiyama B Generation Quality on QM9 Table 3 compares all strategies on the full QM9 dataset on standard molecular generation metrics. Note that while biased strategies exhibit lower Novelty compared to Random Order, we do not consider this a degradation in generative quality. As noted by prior work (Vignac et al...

work page 2023

[1] [1]

S., and Karpatne, A

Bu, J., Mehrab, K. S., and Karpatne, A. (2023). Let there be order: Rethinking ordering in autoregres- sive graph generation

work page 2023

[2] [2]

Chen, D., Krimmel, M., and Borgwardt, K. (2025). Flatten graphs as sequences: Transformers are scal- able graph generators. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems

work page 2025

[3] [3]

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 1321–1330. JMLR.org

work page 2017

[4] [4]

Welling, M. (2022). Equivariant diffusion for molecule generation in 3D. In Chaudhuri, K.,

work page 2022

[5] [5]

Krimmel, M., Hartout, P., Borgwardt, K., and Chen, D. (2025). Polygraph discrepancy: a classifier-based metric for graph generation

work page 2025

[6] [6]

Martinkus, K., Loukas, A., Perraudin, N., and Wat- tenhofer, R. (2022). Spectre: Spectral conditioning helps to overcome the expressivity limits of one-shot graph generators. InInternational Conference on Machine Learning, pages 15159–15179. PMLR

work page 2022

[7] [7]

Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., and Klambauer, G. (2018). Fr´ echet ChemNet dis- tance: A metric for generative models for molecules in drug discovery.Journal of Chemical Information and Modeling, 58(9):1736–1741

work page 2018

[8] [8]

Lilienfeld, O. A. (2014). Quantum chemistry struc- tures and properties of 134 kilo molecules.Scientific Data, 1:140022

work page 2014

[9] [9]

and Frossard, P

Vignac, C. and Frossard, P. (2022). Top-n: Equivari- ant set and graph generation without exchangeabil- ity. InInternational Conference on Learning Repre- sentations

work page 2022

[10] [10]

Cevher, V., and Frossard, P. (2023). Digress: Dis- crete denoising diffusion for graph generation. In The Eleventh International Conference on Learning Representations

work page 2023

[11] [11]

Vinyals, O., Bengio, S., and Kudlur, M. (2016). Order matters: Sequence to sequence for sets. In Bengio, Y. and LeCun, Y., editors,4th International Con- ference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings

work page 2016

[12] [12]

Wang, Y., Wang, J., Cao, Z., and Barati Farimani, A. (2022). Molecular contrastive learning of represen- tations via graph neural networks.Nature Machine Intelligence, 4:279–287

work page 2022

[13] [13]

Leskovec, J. (2018). Graphrnn: Generating real- istic graphs with deep auto-regressive models. In International conference on machine learning, pages 5708–5717. PMLR. Same Graph, Different Likelihoods: Supplementary Materials A Metric Definitions All percentage metrics (Validity, Uniqueness, Novelty, Atm. Stable, Mol. Stable) are reported as fractions of ...

work page 2018

[14] [14]

Note that while biased strategies exhibit lower Novelty compared to Random Order, we do not consider this a degradation in generative quality

Fredsgaard, Thomas, Andersen, Schmidt, and Sugiyama B Generation Quality on QM9 Table 3 compares all strategies on the full QM9 dataset on standard molecular generation metrics. Note that while biased strategies exhibit lower Novelty compared to Random Order, we do not consider this a degradation in generative quality. As noted by prior work (Vignac et al...

work page 2023