Same Graph, Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings
Pith reviewed 2026-05-10 19:46 UTC · model grok-4.3
The pith
Autoregressive graph generators assign inconsistent likelihoods to the same graph under different valid orderings, revealing they learn training sequences rather than graph structure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Autoregressive graph generators assign varying negative log-likelihoods to different linearizations of the identical graph when trained on biased orderings. This variation shows the models have learned the specific training linearization rather than the underlying graph. The coefficient of variation across equivalent SENT linearizations, called Linearization Uncertainty, serves as a more reliable evaluation metric than raw likelihood, achieving an AUC of 0.85 versus 0.43 for predicting molecular stability on QM9.
What carries the argument
Linearization Uncertainty (LU), the coefficient of variation of assigned negative log-likelihood across multiple equivalent SENT linearizations of the same graph.
If this is right
- Biased-order training produces models with expected calibration error two orders of magnitude higher under random permutation than under their native ordering.
- Raw NLL from such models is negatively correlated with molecular stability on QM9, while Linearization Uncertainty is positively correlated and yields higher AUC.
- Permutation-based evaluation supplies a more reliable quality check for generated graphs than standard likelihood.
- Consistent likelihood across linearizations requires training or evaluation strategies that expose the model to multiple orderings of each graph.
Where Pith is reading between the lines
- Training autoregressive graph generators with randomly sampled linearizations at each step could reduce Linearization Uncertainty and improve generalization to new orderings.
- The same inconsistency may appear in other autoregressive models over objects with multiple sequential encodings, such as trees or sequences with equivalent reorderings.
- Linearization Uncertainty could be added as a regularization term to encourage the model to produce stable likelihoods across permutations.
- Inherently permutation-invariant generators might be developed by replacing sequential autoregressive construction with order-agnostic mechanisms.
Load-bearing premise
The multiple SENT linearizations are truly equivalent representations of the identical graph, so variation in assigned NLL reflects the model's failure to learn the graph structure rather than differences in the linearization process.
What would settle it
Train a model by sampling random SENT linearizations for each graph at every training step, then check whether expected calibration error under random permutations falls by at least one order of magnitude compared to biased-order training.
Figures
read the original abstract
Autoregressive graph generators define likelihoods via a sequential construction process, but these likelihoods are only meaningful if they are consistent across all linearizations of the same graph. Segmented Eulerian Neighborhood Trails (SENT), a recent linearization method, converts graphs into sequences that can be perfectly decoded and efficiently processed by language models, but admit multiple equivalent linearizations of the same graph. We quantify violations in assigned negative log-likelihood (NLL) using the coefficient of variation across equivalent linearizations, which we call Linearization Uncertainty (LU). Training transformers under four linearization strategies on two datasets, we show that biased orderings achieve lower NLL on their native order but exhibit expected calibration error (ECE) two orders of magnitude higher under random permutation, indicating that these models have learned their training linearization rather than the underlying graph. On the molecular graph benchmark QM9, NLL for generated graphs is negatively correlated with molecular stability (AUC $=0.43$), while LU achieves AUC $=0.85$, suggesting that permutation-based evaluation provides a more reliable quality check for generated molecules. Code is available at https://github.com/lauritsf/linearization-uncertainty
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Linearization Uncertainty (LU) as the coefficient of variation in negative log-likelihood (NLL) assigned by autoregressive graph generators across multiple equivalent Segmented Eulerian Neighborhood Trails (SENT) linearizations of the same graph. Experiments train transformers on four linearization strategies across two datasets and report that biased orderings yield lower native-order NLL but two-orders-of-magnitude higher expected calibration error (ECE) under random permutation; on QM9, NLL correlates poorly with molecular stability (AUC=0.43) while LU achieves AUC=0.85, with code released at the cited GitHub repository.
Significance. If the interpretation holds, the work identifies a previously under-appreciated failure mode in likelihood-based evaluation of autoregressive graph models and supplies a practical, permutation-based diagnostic that better tracks downstream properties such as molecular stability. The open-source code and concrete AUC/ECE numbers constitute reproducible strengths that could affect evaluation protocols in graph generation.
major comments (2)
- [Abstract] Abstract and the central claim: the assertion that biased models 'have learned their training linearization rather than the underlying graph' requires that distinct valid SENT linearizations are probabilistically equivalent (i.e., a graph-aware model must produce identical NLL values). No derivation is supplied showing that the autoregressive product ∏ P(token_t | tokens_<t) is invariant across different Eulerian trails that decode to the identical graph; different conditionals can arise even when the final graph is the same.
- [Results] Experimental claims on QM9 and the second dataset: the reported AUC values (0.43 for NLL, 0.85 for LU) and ECE differences rest on the assumption that the multiple SENT linearizations are strictly equivalent representations. Without explicit verification that the linearization process itself does not introduce systematic differences in the conditional sequence, the variation in NLL cannot be unambiguously attributed to linearization learning versus graph structure.
minor comments (1)
- [Abstract] The abstract refers to 'two datasets' but names only QM9; the identity and statistics of the second dataset should be stated explicitly in the abstract or early methods section.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We provide point-by-point responses to the major comments and outline the revisions we will make to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract and the central claim: the assertion that biased models 'have learned their training linearization rather than the underlying graph' requires that distinct valid SENT linearizations are probabilistically equivalent (i.e., a graph-aware model must produce identical NLL values). No derivation is supplied showing that the autoregressive product ∏ P(token_t | tokens_<t) is invariant across different Eulerian trails that decode to the identical graph; different conditionals can arise even when the final graph is the same.
Authors: We agree that the manuscript would benefit from a clearer justification. While we do not assert that NLL must be exactly identical (as different sequences can have different probabilities even for the same graph), our use of the coefficient of variation in LU measures the relative consistency. A graph-aware model should not exhibit large discrepancies in assigned likelihoods for equivalent representations. We will add a new subsection in the methods or discussion providing a conceptual argument for why low LU is expected for structure-learning models, including why the autoregressive conditionals should lead to similar NLL when the underlying graph is the same. This addresses the lack of derivation by offering an intuitive and empirical grounding. revision: yes
-
Referee: [Results] Experimental claims on QM9 and the second dataset: the reported AUC values (0.43 for NLL, 0.85 for LU) and ECE differences rest on the assumption that the multiple SENT linearizations are strictly equivalent representations. Without explicit verification that the linearization process itself does not introduce systematic differences in the conditional sequence, the variation in NLL cannot be unambiguously attributed to linearization learning versus graph structure.
Authors: We acknowledge the need for explicit verification. In the revised version, we will include an analysis of the SENT linearization process, demonstrating that all linearizations decode to the identical graph and that sequence-level statistics (such as average length and token frequencies) are comparable across strategies. Additionally, we will report the variance of NLL within the same linearization type to isolate the effect of different linearizations. These additions will ensure that the observed differences in LU and the superior correlation of LU with molecular stability can be confidently attributed to the model's sensitivity to linearization rather than artifacts of the encoding process. revision: yes
Circularity Check
No significant circularity; empirical definitions and measurements are self-contained
full rationale
The paper defines Linearization Uncertainty (LU) directly as the coefficient of variation of observed NLL values across multiple SENT linearizations of the same graph and reports all key results (NLL comparisons, ECE under permutation, AUC correlations on QM9) as direct experimental outcomes from training and evaluating autoregressive models. No equation or derivation reduces these quantities to fitted parameters or prior results by construction. The central claims rest on measurable data variation rather than any self-referential loop, self-citation chain, or ansatz smuggled through prior work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiple SENT linearizations of the same graph are equivalent for the purpose of likelihood assignment
invented entities (1)
-
Linearization Uncertainty (LU)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Autoregressive models on permutation-invariant objects such as graphs and molecules need to linearize them into sequences before applying the chain rule of probability... any distribution over graphs induces equal likelihood across all linearizations of G.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bu, J., Mehrab, K. S., and Karpatne, A. (2023). Let there be order: Rethinking ordering in autoregres- sive graph generation
work page 2023
-
[2]
Chen, D., Krimmel, M., and Borgwardt, K. (2025). Flatten graphs as sequences: Transformers are scal- able graph generators. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems
work page 2025
-
[3]
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 1321–1330. JMLR.org
work page 2017
-
[4]
Welling, M. (2022). Equivariant diffusion for molecule generation in 3D. In Chaudhuri, K.,
work page 2022
-
[5]
Krimmel, M., Hartout, P., Borgwardt, K., and Chen, D. (2025). Polygraph discrepancy: a classifier-based metric for graph generation
work page 2025
-
[6]
Martinkus, K., Loukas, A., Perraudin, N., and Wat- tenhofer, R. (2022). Spectre: Spectral conditioning helps to overcome the expressivity limits of one-shot graph generators. InInternational Conference on Machine Learning, pages 15159–15179. PMLR
work page 2022
-
[7]
Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., and Klambauer, G. (2018). Fr´ echet ChemNet dis- tance: A metric for generative models for molecules in drug discovery.Journal of Chemical Information and Modeling, 58(9):1736–1741
work page 2018
-
[8]
Lilienfeld, O. A. (2014). Quantum chemistry struc- tures and properties of 134 kilo molecules.Scientific Data, 1:140022
work page 2014
-
[9]
Vignac, C. and Frossard, P. (2022). Top-n: Equivari- ant set and graph generation without exchangeabil- ity. InInternational Conference on Learning Repre- sentations
work page 2022
-
[10]
Cevher, V., and Frossard, P. (2023). Digress: Dis- crete denoising diffusion for graph generation. In The Eleventh International Conference on Learning Representations
work page 2023
-
[11]
Vinyals, O., Bengio, S., and Kudlur, M. (2016). Order matters: Sequence to sequence for sets. In Bengio, Y. and LeCun, Y., editors,4th International Con- ference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings
work page 2016
-
[12]
Wang, Y., Wang, J., Cao, Z., and Barati Farimani, A. (2022). Molecular contrastive learning of represen- tations via graph neural networks.Nature Machine Intelligence, 4:279–287
work page 2022
-
[13]
Leskovec, J. (2018). Graphrnn: Generating real- istic graphs with deep auto-regressive models. In International conference on machine learning, pages 5708–5717. PMLR. Same Graph, Different Likelihoods: Supplementary Materials A Metric Definitions All percentage metrics (Validity, Uniqueness, Novelty, Atm. Stable, Mol. Stable) are reported as fractions of ...
work page 2018
-
[14]
Fredsgaard, Thomas, Andersen, Schmidt, and Sugiyama B Generation Quality on QM9 Table 3 compares all strategies on the full QM9 dataset on standard molecular generation metrics. Note that while biased strategies exhibit lower Novelty compared to Random Order, we do not consider this a degradation in generative quality. As noted by prior work (Vignac et al...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.