pith. sign in

arxiv: 2605.04834 · v1 · submitted 2026-05-06 · 💻 cs.LG

Bridging Input Feature Spaces Towards Graph Foundation Models

Pith reviewed 2026-05-08 17:12 UTC · model grok-4.3

classification 💻 cs.LG
keywords graph transfer learninginput feature invariancerandom projectionscovariance operatorsgraph foundation modelsnode representationsorthogonal invariance
0
0 comments X

The pith

Projecting node features into a shared random space and computing covariance statistics produces representations invariant to input feature permutations and orthogonal transformations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Graph learning lacks a shared input space because node features vary in semantics, ranges, and dimensions across datasets, which blocks generalization and foundation-model use. ALL-IN projects features randomly into a common space then builds representations from covariance operators, removing dependence on the original feature space. The resulting operators and representations stay distributionally invariant under feature permutations, while their expectation is invariant under orthogonal transformations. Models using these representations deliver strong performance on entirely new datasets with unseen features, without architecture changes or retraining. This supplies a concrete route to input-agnostic, transferable graph models.

Core claim

The ALL-IN method projects node features into a shared random space and constructs representations via covariance-based statistics, eliminating dependence on the original feature space. The computed node-covariance operators and resulting node representations are invariant in distribution to permutations of the input features. The expected operator further exhibits invariance to general orthogonal transformations of the input features. Empirically this yields strong transfer across node- and graph-level tasks on unseen datasets with new input features, without requiring architecture changes or retraining.

What carries the argument

Random projection of node features into a shared space followed by covariance-based statistics to form node representations

If this is right

  • A single model can be trained once and applied directly to new graph datasets whose features differ in dimension, range, or semantics.
  • Transfer succeeds on both node classification and graph classification tasks without task-specific fine-tuning.
  • Node representations stay consistent in distribution under any permutation or orthogonal change of the input features.
  • No dataset-specific feature alignment or preprocessing is required for cross-dataset use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection-plus-covariance construction might extend to other domains where input spaces are unaligned, such as heterogeneous time-series or multi-modal graphs.
  • If the invariance holds in practice, foundation-model training could shift from semantic feature matching to learning over these statistical invariants.
  • Scalability tests on very large graphs would clarify whether the random projection dimension needs to grow with graph size or remains constant.
  • Combining ALL-IN with existing graph pre-training objectives could produce models that transfer across both features and tasks simultaneously.

Load-bearing premise

Random projection into a shared space followed by covariance statistics preserves enough task-relevant signal across arbitrary real-world feature distributions so that invariance produces useful transfer performance without additional adaptation.

What would settle it

Measure whether the computed node representations remain statistically similar when the original input features are randomly permuted or subjected to a random orthogonal transformation; if transfer accuracy on held-out datasets with remapped features falls to near-chance levels, the practical utility of the invariance claim is refuted.

Figures

Figures reproduced from arXiv: 2605.04834 by Beatrice Bevilacqua, Bruno Ribeiro, Carola-Bibiane Sch\"onlieb, Krishna Sri Ipsit Mantri, Moshe Eliasof.

Figure 1
Figure 1. Figure 1: Addressing feature heterogeneity with ALL-IN’s node-covariance operators. (a) When a GNN is trained on graph data with node features X of dimension d, it cannot be directly applied on graphs with features of a different dimensionality d ′ . (b) ALL-IN computes n × n node-covariance operators, capturing node similarities, providing a common space that is independent of the original, heterogeneous, feature s… view at source ↗
Figure 2
Figure 2. Figure 2: The ALL-IN Architecture. Input node features X are first randomly projected into R(0) . This R(0) serves as initial node representations H(0). Concurrently, R(0) and its propagated versions (e.g., R(p) = ApR(0)) are used to compute a set of node-covariance matrices {K(p)} k p=0 capturing diverse orders of feature-based node similarities. These matrices are used as operators within different GNN (sub-)layer… view at source ↗
read the original abstract

Unlike vision and language domains, graph learning lacks a shared input space, as input features differ across graph datasets not only in semantics, but also in value ranges and dimensionality. This misalignment prevents graph models from generalizing across datasets, limiting their use as foundation models. In this work, we propose ALL-IN, a simple and theoretically grounded method that enables transferability across datasets with different input features. Our approach projects node features into a shared random space and constructs representations via covariance-based statistics, thus eliminating dependence on the original feature space. We show that the computed node-covariance operators and the resulting node representations are invariant in distribution to permutations of the input features. We further demonstrate that the expected operator exhibits invariance to general orthogonal transformations of the input features. Empirically, ALL-IN achieves strong performance across diverse node- and graph-level tasks on unseen datasets with new input features, without requiring architecture changes or retraining. These results point to a promising direction for input-agnostic, transferable graph models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ALL-IN, which projects arbitrary node features into a fixed random space and derives node representations from covariance operators. It claims these representations are invariant in distribution to input feature permutations and that the expected operator is invariant to orthogonal transformations of the features. The method is positioned as enabling zero-shot transfer of a fixed GNN across graph datasets with mismatched feature spaces, dimensions, and semantics, with empirical results reported on diverse node- and graph-level tasks without retraining or architecture changes.

Significance. If the invariance properties translate into reliable preservation of task-relevant signal, the approach would address a core obstacle to graph foundation models by removing dependence on dataset-specific input features. The theoretical component (distributional invariance under permutation and expected-operator invariance under orthogonal maps) is a clear strength, as is the attempt at parameter-free transfer; however, the practical utility hinges on whether random projections retain discriminative information across semantically unrelated feature distributions.

major comments (2)
  1. Abstract and §3: The proofs establish distributional invariance of the covariance operator to column permutations of X and invariance of the expected operator under X' = XQ for orthogonal Q. These properties follow directly from the construction but address only basis changes within a single feature space; they do not establish that the projected second-order statistics remain close in distribution or retain task signal when the original features X come from unrelated semantic domains (e.g., degree sequences versus pretrained embeddings), which is required for the zero-shot transfer claim.
  2. §4: The reported cross-dataset performance is presented as evidence that invariance enables transfer, yet no controls are described that isolate whether performance degrades when the random projection dimension is varied or when the projection matrix is re-sampled, nor are there comparisons against simple baselines that also use fixed random projections without the covariance step. This leaves open whether the invariance itself, rather than incidental properties of the projection, drives the results.
minor comments (2)
  1. Notation: The distinction between the random projection matrix and the resulting covariance operator should be made explicit in the first paragraph of §3 to avoid ambiguity when reading the invariance statements.
  2. Experiments: Table 1 (or equivalent) would benefit from an additional column reporting the original feature dimensionality of each dataset to make the degree of input misalignment concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below, providing clarifications on the theoretical results and agreeing to strengthen the empirical analysis with additional controls.

read point-by-point responses
  1. Referee: Abstract and §3: The proofs establish distributional invariance of the covariance operator to column permutations of X and invariance of the expected operator under X' = XQ for orthogonal Q. These properties follow directly from the construction but address only basis changes within a single feature space; they do not establish that the projected second-order statistics remain close in distribution or retain task signal when the original features X come from unrelated semantic domains (e.g., degree sequences versus pretrained embeddings), which is required for the zero-shot transfer claim.

    Authors: We agree that the proven invariances apply to transformations (permutations of columns and orthogonal maps) within a given feature matrix X. These results establish that the covariance operator—and thus the node representations—are independent of the specific ordering or basis of the original features, which directly supports applying a fixed GNN across datasets whose input features differ in dimension and semantics. The theory does not claim that second-order statistics from unrelated semantic domains (such as degrees versus embeddings) will necessarily be close in distribution; signal preservation in such cases is an empirical question. Our cross-dataset experiments provide evidence that the random projection plus covariance construction retains sufficient task-relevant information for the evaluated tasks. We will revise the abstract and §3 to more clearly separate the scope of the formal invariances from the empirical transfer results. revision: partial

  2. Referee: §4: The reported cross-dataset performance is presented as evidence that invariance enables transfer, yet no controls are described that isolate whether performance degrades when the random projection dimension is varied or when the projection matrix is re-sampled, nor are there comparisons against simple baselines that also use fixed random projections without the covariance step. This leaves open whether the invariance itself, rather than incidental properties of the projection, drives the results.

    Authors: We acknowledge that the current experiments would benefit from these controls. In the revised manuscript we will add (i) performance curves for varying projection dimensions, (ii) results across multiple independent draws of the random projection matrix to demonstrate stability, and (iii) direct comparisons against baselines that feed randomly projected node features into the same GNN without the covariance operator. These additions will help isolate the contribution of the covariance-based representations. revision: yes

Circularity Check

0 steps flagged

No circularity: invariance properties are direct mathematical consequences of the random projection and covariance construction

full rationale

The paper defines ALL-IN explicitly as random projection of node features into a shared space followed by covariance-based statistics. The abstract states that the resulting operators and representations 'are invariant in distribution to permutations of the input features' and that 'the expected operator exhibits invariance to general orthogonal transformations.' These are standard algebraic consequences of the chosen operators (random matrix projection commutes with column permutation; covariance is invariant in expectation under orthogonal change of basis), not derived predictions or fitted quantities. No parameters are tuned on data subsets and then re-used as 'predictions,' no self-citations are invoked to justify uniqueness or ansatzes, and the empirical transfer results are presented separately as experimental outcomes rather than logical entailments. The derivation chain is therefore self-contained and externally verifiable by direct substitution into the defining equations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard linear-algebraic properties of random projections and covariance but introduces domain assumptions about information preservation across feature spaces.

axioms (2)
  • domain assumption Random projection into a fixed shared space preserves the distributional properties needed for covariance invariance under feature permutation
    Invoked to claim that node representations become independent of the original feature space.
  • domain assumption Covariance operators computed after projection remain statistically unchanged under orthogonal transformations of the input
    Used to establish the expected-operator invariance result.

pith-pipeline@v0.9.0 · 5485 in / 1336 out tokens · 75421 ms · 2026-05-08T17:12:40.200601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    Sparse covariance neural networks.arXiv preprint arXiv:2410.01669,

    Andrea Cavallo, Zhan Gao, and Elvin Isufi. Sparse covariance neural networks.arXiv preprint arXiv:2410.01669,

  2. [2]

    Fair covariance neural networks

    Andrea Cavallo, Madeline Navarro, Santiago Segarra, and Elvin Isufi. Fair covariance neural networks. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  3. [3]

    AutoGFM: Automated graph foundation model with adaptive architecture customization

    Haibo Chen, Xin Wang, Zeyang Zhang, Haoyang Li, Ling Feng, and Wenwu Zhu. AutoGFM: Automated graph foundation model with adaptive architecture customization. InForty-second International Conference on Machine Learning, 2025a. URL https://openreview.net/ forum?id=fCPB0qRJT2. Jialin Chen, Haolan Zuo, Haoyu Peter Wang, Siqi Miao, Pan Li, and Rex Ying. Toward...

  4. [4]

    11 Published as a conference paper at ICLR 2026 Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson

    URL https://openreview.net/forum?id=n6jl7fLxrP. 11 Published as a conference paper at ICLR 2026 Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Graph neural networks with learnable structural and positional representations. InInternational Conference on Learning Representations, 2022a. Vijay Prakash Dwivedi, Ladisla...

  5. [5]

    URL https://openreview.net/forum?id=mSoDRZXsqj

    ISSN 2835-8856. URL https://openreview.net/forum?id=mSoDRZXsqj. Reproducibility Certification. Fabrizio Frasca, Fabian Jogl, Moshe Eliasof, Matan Ostrovsky, Carola-Bibiane Schönlieb, Thomas Gärtner, and Haggai Maron. Towards foundation models on graphs: An analysis on cross-dataset transfer of pretrained gnns.arXiv preprint arXiv:2412.17609,

  6. [6]

    Double equivariance for inductive link prediction for both new nodes and new relation types.arXiv preprint arXiv:2302.01313,

    Jianfei Gao, Yangze Zhou, Jincheng Zhou, and Bruno Ribeiro. Double equivariance for inductive link prediction for both new nodes and new relation types.arXiv preprint arXiv:2302.01313,

  7. [7]

    Unigraph: Learning a unified cross-domain foundation model for text-attributed graphs

    Yufei He and Bryan Hooi. Unigraph: Learning a cross-domain graph foundation model from natural language.ArXiv, abs/2402.13630,

  8. [8]

    Tl-pca: Transfer learning of principal component analysis.arXiv preprint arXiv:2410.10805,

    Sharon Hendy and Yehuda Dar. Tl-pca: Transfer learning of principal component analysis.arXiv preprint arXiv:2410.10805,

  9. [9]

    Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020a

    Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020a. 12 Published as a conference paper at ICLR 2026 Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Li...

  10. [10]

    How expressive are knowledge graph foundation models? arXiv preprint arXiv:2502.13339,

    Xingyue Huang, Pablo Barceló, Michael M Bronstein, ˙Ismail ˙Ilkan Ceylan, Mikhail Galkin, Juan L Reutter, and Miguel Romero Orth. How expressive are knowledge graph foundation models? arXiv preprint arXiv:2502.13339,

  11. [11]

    Revisiting random walks for learning on graphs.arXiv preprint arXiv:2407.01214,

    Jinwoo Kim, Olga Zaghen, Ayhan Suleymanzade, Youngmin Ryou, and Seunghoon Hong. Revisiting random walks for learning on graphs.arXiv preprint arXiv:2407.01214,

  12. [12]

    GraphFM: A scalable framework for multi-graph pretraining,

    Divyansha Lachi, Mehdi Azabou, Vinam Arora, and Eva Dyer. GraphFM: a scalable framework for multi-graph pretraining.arXiv preprint arXiv:2407.11907,

  13. [13]

    Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann

    Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. InICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020),

  14. [14]

    Let your graph do the talking: Encoding structured data for llms

    Bryan Perozzi, Bahare Fatemi, Dustin Zelle, Anton Tsitsulin, Mehran Kazemi, Rami Al-Rfou, and Jonathan Halcrow. Let your graph do the talking: Encoding structured data for llms.arXiv preprint arXiv:2402.05862,

  15. [15]

    Recipe for a general, powerful, scalable graph transformer.Advances in Neural Information Processing Systems, 35:14501–14515,

    13 Published as a conference paper at ICLR 2026 Ladislav Rampášek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Do- minique Beaini. Recipe for a general, powerful, scalable graph transformer.Advances in Neural Information Processing Systems, 35:14501–14515,

  16. [16]

    Li Sun, Zhenhao Huang, Suyang Zhou, Qiqi Wan, Hao Peng, and Philip S. Yu. RiemannGFM: Learning a graph foundation model from structural geometry. InTHE WEB CONFERENCE 2025,

  17. [17]

    Cov- ered forest: Fine-grained generalization analysis of graph neural networks.arXiv preprint arXiv:2412.07106,

    Antonis Vasileiou, Ben Finkelshtein, Floris Geerts, Ron Levie, and Christopher Morris. Cov- ered forest: Fine-grained generalization analysis of graph neural networks.arXiv preprint arXiv:2412.07106,

  18. [18]

    Survey on generaliza- tion theory for graph neural networks.arXiv preprint arXiv:2503.15650, 2025

    Antonis Vasileiou, Stefanie Jegelka, Ron Levie, and Christopher Morris. Survey on generalization theory for graph neural networks.arXiv preprint arXiv:2503.15650,

  19. [19]

    Towards graph foundation models: The perspective of zero-shot reasoning on knowledge graphs.arXiv preprint arXiv:2410.12609,

    Kai Wang and Siqiang Luo. Towards graph foundation models: The perspective of zero-shot reasoning on knowledge graphs.arXiv preprint arXiv:2410.12609,

  20. [20]

    Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao

    URL https://openreview.net/forum? id=0MXzbAv8xy. Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920,

  21. [21]

    Xia and C

    Lianghao Xia and Chao Huang. Anygraph: Graph foundation model in the wild.arXiv preprint arXiv:2408.10700,

  22. [22]

    Revisiting semi-supervised learning with graph embeddings

    14 Published as a conference paper at ICLR 2026 Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. InInternational conference on machine learning, pp. 40–48. PMLR,

  23. [23]

    Hgprompt: Bridg- ing homogeneous and heterogeneous graphs for few-shot prompt learning

    Xingtong Yu, Chang Zhou, Yuan Fang, and Xinming Zhang. Text-free multi-domain graph pre- training: Toward graph foundation models.arXiv preprint arXiv:2405.13934,

  24. [24]

    SAMGPT: Text-free graph foundation model for multi-domain pre-training and cross-domain adaptation

    Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, and Hui Zhang. SAMGPT: Text-free graph foundation model for multi-domain pre-training and cross-domain adaptation. InTHE WEB CON- FERENCE 2025,

  25. [25]

    Kexin Zhang, Shuhan Liu, Song Wang, Weili Shi, Chen Chen, Pan Li, Sheng Li, Jundong Li, and Kaize Ding

    URLhttps://openreview.net/forum?id=bjDKZ3Roax. Kexin Zhang, Shuhan Liu, Song Wang, Weili Shi, Chen Chen, Pan Li, Sheng Li, Jundong Li, and Kaize Ding. A survey of deep graph learning under distribution shifts: from graph out-of- distribution generalization to adaptation.arXiv preprint arXiv:2410.19265, 2024a. Yu Zhang and Qiang Yang. A survey on multi-tas...

  26. [26]

    Graphtext: Graph rea- soning in text space.arXiv preprint arXiv:2310.01089,

    Yucheng Zhang, Beatrice Bevilacqua, Mikhail Galkin, and Bruno Ribeiro. TRIX: A more expressive model for zero-shot domain transfer in knowledge graphs. InThe Third Learning on Graphs Conference, 2024b. Haihong Zhao, Aochuan Chen, Xiangguo Sun, Hong Cheng, and Jia Li. All in one and one for all: A simple yet effective method towards cross-domain graph pret...

  27. [27]

    Graphany: A foundation model for node classification on any graph.arXiv preprint arXiv:2405.20445, 2024a

    Jianan Zhao, Hesham Mostafa, Mikhail Galkin, Michael Bronstein, Zhaocheng Zhu, and Jian Tang. Graphany: A foundation model for node classification on any graph.ArXiv, abs/2405.20445, 2024b. Jincheng Zhou, Beatrice Bevilacqua, and Bruno Ribeiro. A multi-task perspective for link prediction with new relation types and nodes. InNeurIPS 2023 Workshop: New Fro...

  28. [28]

    15 Published as a conference paper at ICLR 2026 A ADDITIONALRELATEDWORK Generalization Theory of MPNNs.Significant theoretical progress has advanced our under- standing of generalization in Message Passing Neural Networks (MPNNs). As discussed in recent surveys (Vasileiou et al., 2025; Zhang et al., 2024a), these efforts often focus on how architectures a...

  29. [29]

    double equivariance

    tackle this by employing set aggregation techniques over representa- tions specific to edge types, aiming for equivariance to permutations of these types, supported by a “double equivariance” theoretical framework. Similarly, methods like InGram (Lee et al., 2023), ULTRA (Galkin et al., 2024), TRIX (Zhang et al., 2024b), and MOTIF (Huang et al.,

  30. [30]

    However, such approaches typically assume that the underlying node feature space remains consistent across these tasks

    proposes a framework to learn node representations that can be applied to various downstream tasks on a given graph or graphs. However, such approaches typically assume that the underlying node feature space remains consistent across these tasks. ALL-IN, con- versely, is specifically designed to address the challenge of generalizing to new and unseen data...

  31. [31]

    k . Furthermore, since all operators K(p) in K are derived from the same R(0), and all operators ¯K(p) in ¯K are derived from ¯R(0), the distributional equality extends to the joint distribution of the sets:K d = ¯K. Theorem B.1(Distributional Invariance of Hidden Representations to Input Permutation).Let X∈R n×d be node features, and P∈R d×d be any permu...

  32. [32]

    Therefore, with probability 1, (K (0))uw ̸= (K (0))vw

    Since R(0) c,u −R (0) c,v ̸=0 almost surely, and R(0) c,w is a random vector (whose distribution depends on C), the event that their dot product is exactly zero has probability 0 for continuous distributions unless one of them is deterministically zero (which is not the case here a.s.). Therefore, with probability 1, (K (0))uw ̸= (K (0))vw. This breaks th...

  33. [33]

    SUPERVISEDBASELINESinclude (a) MLP: a multi-layer perceptron directly on the target dataset features without using graph structure; serves as a non-graph baseline

    0.1630 75.58 1.17 74.91 96.485 55.255 73.13 79.05 86.31 ALL-IN-SPECIALIZED(0 props) 0.1480 72.65 1.22 69.37 94.03 39.96 37.24 85.19 91.65 ALL-IN-SPECIALIZED0.1195 73.78 1.19 70.04 94.77 40.03 39.81 87.20 94.16 TRAINED ON ALL DATASETS ALL-IN(0 props) 0.1557 72.74 1.28 68.19 94.57 40.11 37.11 89.88 97.51 ALL-IN0.1237 74.49 1.29 68.20 95.22 40.08 39.37 91.17...

  34. [34]

    These fall under supervised baselines as they do not perform pretraining or transfer, and rely solely on training from scratch on each dataset

    trained from scratch, included to represent expressive message-passing GNNs in supervised settings. These fall under supervised baselines as they do not perform pretraining or transfer, and rely solely on training from scratch on each dataset. LLM-AUGMENTEDGNNSinclude (a) OFA (Liu et al., 2024): constructs a prompt-augmented graph using text nodes and pre...

  35. [35]

    All of these are grouped under GNN-BASEDbaselines as they rely on pretraining GNNs (often with auxiliary components like prompts or experts) to enable generalization to new graphs

    introduces zero-shot reasoning on knowledge graphs using graph topology. All of these are grouped under GNN-BASEDbaselines as they rely on pretraining GNNs (often with auxiliary components like prompts or experts) to enable generalization to new graphs. C.2 COMPARISON TOMETHODSTRAINED ON EACHINDIVIDUALDATASET In this section, we compare the performance of...

  36. [36]

    • GCOPE(Zhao et al., 2024a): This method introduces one virtual node for each node classification dataset, connecting it to all the nodes within the dataset

    do not have node text attributes, we describe the input node features and pass them to ChatGPT. • GCOPE(Zhao et al., 2024a): This method introduces one virtual node for each node classification dataset, connecting it to all the nodes within the dataset. To perform graph classification, we introduce one virtual node for each graph classification dataset an...

  37. [37]

    85.33±2.11 68.54±1.47 ALL-IN(0 props) 92.50±6.60 76.72±3.19 ALL-IN92.90±6.34 78.20±3.81 Table 7: The impact of SPEs and random projections in Equation (5). ALL-INwith SPEs performs best, while using only SPEs leads to a significant drop in performance, highlighting the importance of random feature projections, which cannot be compensated by using SVD. Met...

  38. [38]

    This trend aligns with Proposition 4.6: as h grows, the stochastic operator concentrates around its expectation

    Across datasets, performance improves from very small h and then plateaus at 256, and gains beyond that are marginal. This trend aligns with Proposition 4.6: as h grows, the stochastic operator concentrates around its expectation. In practice, a moderate h achieves near-saturated accuracy with a better compute/memory trade-off than a very large h. Therefo...

  39. [39]

    62.33 GraphAny (Zhao et al., 2024b) 58.38 ALL-IN75.27 with understandings from literature on heterophily in graphs (Zhu et al., 2020; Chien et al., 2021). Motivated by this discussion, we conduct an ablation study where we vary the number of propagation orders k∈ {0,1,2} used in the covariance operators and evaluate downstream performance on Actor, Chamel...

  40. [40]

    From Table 17, we observe that including the edge-based covariance operator yields substantially better performance 27 Published as a conference paper at ICLR 2026 Table 12: Performance on heterophilic datasets, using the splits in Pei et al. (2020). Method Actor Chameleon Squirrel (ACC↑) (ACC↑) (ACC↑) NON-PARAMETRICBASELINES LABELPROPAGATION(Zhu & Ghahramani,

  41. [41]

    Method AmzRatings Minesweeper Tolokers (ACC↑) (ACC↑) (ACC↑) GCN (Kipf & Welling,

    23.26±0.56 N/A N/A ALL-IN29.47±0.38 67.40±1.29 49.98±0.73 Table 13: Performance on the AmzRating, Minesweep, Tolokers datasets (Platonov et al., 2023). Method AmzRatings Minesweeper Tolokers (ACC↑) (ACC↑) (ACC↑) GCN (Kipf & Welling,

  42. [42]

    In both cases, we use the same ALL-IN encoder as in the main experiments and compare with two GNN baselines (GIN and GPS)

    and report mean Intersection-over-Union (mIoU); forIMDB-B, we report classification accuracy. In both cases, we use the same ALL-IN encoder as in the main experiments and compare with two GNN baselines (GIN and GPS). As shown in Table 20, ALL-INconsistently outperforms the GIN and GPS baselines on both ShapeNet and IMDB-B. This indicates that the input-sp...

  43. [43]

    In the main experiments we set k= 0,2 as a default choice

    Intu- itively, increasing the number of propagation orders k allows the covariance operators to incorporate multi-hop information coupled with the input features, at the cost of additional computations and operators. In the main experiments we set k= 0,2 as a default choice. Here, we provide an extended ablation over k∈ {0,1,2,3,4} . In this study we vary...

  44. [44]

    We report mean absolute error (MAE) as the evaluation metric

    is a molecular property prediction dataset where the task is regressing the constrained solubility values of molecules. We report mean absolute error (MAE) as the evaluation metric. 30 Published as a conference paper at ICLR 2026 Table 18: ALL-INpre-training performance on different pre-training corpus, with and without citation networks. Pre-training cor...

  45. [45]

    We use the 10-class subset

    is a 3D object classification benchmark where shapes are represented as fixed-size point cloud graphs. We use the 10-class subset. • CUNEIFORMMorris et al. (2020) is a graph-based OCR dataset derived from ancient script symbols, consisting of 62-node graphs with 150 edges on average and a 30-class prediction target. • MSRC-21Morris et al. (2020) is an ima...

  46. [46]

    For experiment tracking and hyperparameter logging, we utilize the Weights and Biases framework (Biewald, 2020)

    (MIT license). For experiment tracking and hyperparameter logging, we utilize the Weights and Biases framework (Biewald, 2020). Experiments were conducted with NVIDIA RTX A6000, RTX 4090, and NVIDIA A100 GPUs. 31 Published as a conference paper at ICLR 2026 Table 20: Transfer to 3D shapes (ShapeNet) and social networks (IMDB-B) with ALL-IN. Higher is bett...

  47. [47]

    To accelerate training, (1) we use DataParallel to support multi-GPU runs, (2) cache the random projection matrix C and refresh every 100 steps, (3) sample 10,000 graphs randomly at each epoch forMNISTandCIFAR10, and (4) sample 128 nodes with 6-nearest neighbors as edges for MODELNETin each graph. E.2 EVALUATION ONUNSEENDATASETS ANDINPUTSPACES(Q2) To eval...