pith. sign in

arxiv: 2606.03681 · v1 · pith:PV3UVV2Inew · submitted 2026-06-02 · 💻 cs.LG

Speedrunning Tabular Foundation Model Pretraining

Pith reviewed 2026-06-28 11:36 UTC · model grok-4.3

classification 💻 cs.LG
keywords tabular foundation modelspretraining speedupsspeedrunnanoTabPFNTabArenaROC AUC targetcommunity leaderboard
0
0 comments X

The pith

A community speedrun protocol reaches tabular foundation model pretraining targets in 0.92 minutes, an 81x improvement over the baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a speedrun format for nanoTabPFN pretraining in which contributors edit a single-file script to reach a fixed downstream ROC AUC target faster on subsampled TabArena with one L40S GPU. This setup creates an open leaderboard where speedups can be added, verified, and stacked by the community. The current record uses 22x fewer synthetic datasets while cutting time from 74.32 minutes to 0.92 minutes. A sympathetic reader would care because pretraining cost currently limits how often researchers can try new architectures, priors, or optimizers. The protocol aims to shorten that iteration cycle through simple, comparable benchmarks.

Core claim

By establishing a speedrun challenge with a fixed ROC AUC target on subsampled TabArena and one NVIDIA L40S GPU, the authors create a standardized protocol that lets participants modify the training script and compete directly on pretraining time, with the best entry currently achieving the target in 0.92 minutes versus the 74.32-minute baseline while requiring 22x fewer synthetic datasets.

What carries the argument

The speedrun protocol: a fixed downstream performance target, single-file training script, and public leaderboard that enables verification and stacking of pretraining modifications.

If this is right

  • New ideas for architectures, priors, or optimization can be tested by submitting modified scripts and measuring time to target.
  • Successful modifications can be combined over successive leaderboard updates.
  • Pretraining experiments become feasible on modest hardware, shortening research cycles.
  • The community gains a shared, auditable record of efficiency gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The format could be adapted to other foundation-model domains where pretraining cost is the main bottleneck.
  • The winning entry's use of far less data points to data efficiency as a major route to speedups.
  • Widespread adoption might shift emphasis from scaling compute to measuring and improving training efficiency.

Load-bearing premise

That reaching the fixed ROC AUC target on subsampled TabArena reliably indicates overall pretraining quality regardless of how the training script is altered.

What would settle it

A script that hits the target faster yet produces models with lower performance on a wider collection of tabular tasks or real datasets outside the speedrun benchmark.

Figures

Figures reproduced from arXiv: 2606.03681 by Alexander Pfefferle, Frank Hutter, Salih Bora Ozturk.

Figure 1
Figure 1. Figure 1: Two parallel speedrun lineages. The language side (top) is established. modded-nanoTabPFN (red) is the contribution of this paper. testing which techniques transfer from language modeling. The current record at the time of writing reduces pretraining wallclock time from our baseline of 74.32 minutes to 0.92 minutes on a single GPU (NVIDIA L40S), an 81× speedup, while matching the baseline predictive perfor… view at source ↗
Figure 2
Figure 2. Figure 2: Records of [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: High-level architecture of the baseline. Input features x are encoded with a feature encoder, target labels y with a target encoder and extended with the train-row mean to match the data shape. The two streams are merged and passed through a stack of 6 transformer blocks, each applying feature-axis attention, norm, sample-axis attention, norm, MLP, norm. The test-row outputs are sliced out and fed into the… view at source ↗
Figure 4
Figure 4. Figure 4: High-level architecture of the current best record. Changes from the baseline are marked in red. Input features are grouped at the feature encoder, 24 learnable thinking rows are appended along the data axis, the residual stream entering each block is decayed by 0.95i , the norms move ahead of attention and MLP, and the stack is reduced to 5 blocks. The mean over feature embeddings at the test rows is fed … view at source ↗
Figure 5
Figure 5. Figure 5: Long-run training trajectory of modded-nanoTabPFN. The x-axis is on a log scale and starts at 2,560 synthetic datasets for visual clarity. Markers indicate the best and last values reached. Dotted lines show the Random Forest target where the speedrun normally stops, and TabPFN v2.5 with no preprocessing as an upper reference. pretraining wallclock time to a fixed target, a comple￾mentary question is what … view at source ↗
Figure 6
Figure 6. Figure 6: Row attention map of the trained model on the TabArena task jm1, subsampled to 100 datapoints for visual clarity, with all encoder layers and heads shown. The first 16 positions on each axis are the prepended thinking rows. In each panel, rows are queries and columns are keys. Vertical bands at the thinking-token columns show data rows placing attention on the thinking tokens, with layer 1 head 3 in partic… view at source ↗
read the original abstract

Pretraining cost is a major bottleneck for research on tabular foundation models, slowing the iteration cycle for new architectures, priors, and optimization ideas. Yet the community lacks a simple way to compare and accumulate pretraining speedups. We introduce a community speedrun for nanoTabPFN: contributors modify a single-file training script and compete to reach a fixed downstream ROC AUC target on subsampled TabArena using one NVIDIA L40S GPU. The current best record reaches the target in 0.92 minutes, an 81x speedup over the 74.32 minute baseline while using 22x fewer synthetic datasets. The speedrun format provides a simple protocol for the community to add, verify, and stack pretraining improvements, with the leaderboard open to contributions. Code and records are available at https://github.com/borawhocodess/modded-nanotabpfn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces a community 'speedrun' benchmark for pretraining nanoTabPFN tabular foundation models. Participants modify a single-file training script to reach a fixed downstream ROC AUC target on a subsampled TabArena dataset using one NVIDIA L40S GPU. The current record achieves the target in 0.92 minutes (81x speedup over the 74.32-minute baseline) while using 22x fewer synthetic datasets. The format is intended to enable the community to add, verify, and stack pretraining improvements via an open leaderboard, with code available at the provided GitHub link.

Significance. If the evaluation protocol proves robust, the speedrun format could meaningfully lower iteration costs for tabular foundation model research by providing a simple, low-resource, community-verifiable benchmark that accumulates incremental gains. The open code and explicit empirical record (wall-clock time and dataset count) are strengths that support reproducibility and stacking of improvements.

major comments (1)
  1. [Abstract] Abstract: The headline claim of an 81x speedup and a useful community protocol rests on the assumption that a fixed ROC AUC target on the subsampled TabArena remains a stable, comparable proxy for pretraining quality under arbitrary modifications to data generation, optimization, architecture, or regularization. No analysis, ablation, or verification is provided that this specific target and subsample do not admit exploits of idiosyncrasies (e.g., the exact threshold, GPU timing, or distribution shift) without producing generally better foundation models; this is load-bearing for the central empirical contribution.
minor comments (1)
  1. The manuscript would benefit from an explicit section detailing the precise measurement protocol, baseline implementation details, dataset subsampling procedure, and any exclusion rules for the speedrun to allow independent reproduction and extension.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of validating the evaluation protocol. The concern about the fixed ROC AUC target serving as a robust proxy is substantive and directly relevant to the benchmark's long-term value. We address it point-by-point below and commit to revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim of an 81x speedup and a useful community protocol rests on the assumption that a fixed ROC AUC target on the subsampled TabArena remains a stable, comparable proxy for pretraining quality under arbitrary modifications to data generation, optimization, architecture, or regularization. No analysis, ablation, or verification is provided that this specific target and subsample do not admit exploits of idiosyncrasies (e.g., the exact threshold, GPU timing, or distribution shift) without producing generally better foundation models; this is load-bearing for the central empirical contribution.

    Authors: We agree that the stability of the chosen target and subsample is load-bearing and that the manuscript provides no explicit ablations or verification against potential exploits. The target was selected in preliminary runs to be reachable by the baseline yet require non-trivial improvements; the subsample size was chosen for computational feasibility on a single L40S. However, we did not test sensitivity to the precise AUC threshold, timing variance across GPU runs, or correlation with performance on held-out datasets or shifted distributions. In revision we will add a dedicated subsection under 'Evaluation Protocol' that (1) reports the exact target selection procedure, (2) includes a small set of sanity checks (re-evaluating the current record on two additional TabArena splits and on a different downstream metric), and (3) explicitly discusses known limitations and the role of the open leaderboard in surfacing future exploits. We view these additions as necessary to support the central claim. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark with no derivation chain

full rationale

The paper introduces a community speedrun protocol for tabular foundation model pretraining and reports an empirical wall-clock record (0.92 min vs baseline). No mathematical derivations, equations, parameter fittings, predictions, or self-citation chains are present in the provided text. The contribution is a benchmark setup and leaderboard, not a claimed derivation that reduces to its inputs. This is self-contained against external benchmarks by design.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is an empirical benchmark protocol rather than a mathematical model or derivation; no free parameters, axioms, or invented entities are identifiable from the abstract.

pith-pipeline@v0.9.1-grok · 5678 in / 1206 out tokens · 30589 ms · 2026-06-28T11:36:59.220624+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references

  1. [1]

    2026 , eprint=

    TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models , author=. 2026 , eprint=

  2. [2]

    Deep Neural Networks and Tabular Data: A Survey , journal =

    Borisov, Vadim and Leemann, Tobias and Se. Deep Neural Networks and Tabular Data: A Survey , journal =. 2024 , doi =

  3. [3]

    Proceedings of the 41st International Conference on Machine Learning , series =

    van Breugel, Boris and van der Schaar, Mihaela , title =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , url =

  4. [4]

    The Eleventh International Conference on Learning Representations (ICLR) , publisher =

    Hollmann, Noah and M. The Eleventh International Conference on Learning Representations (ICLR) , publisher =. 2023 , url =

  5. [5]

    Accurate predictions on small data with a tabular foundation model , journal =

    Hollmann, Noah and M. Accurate predictions on small data with a tabular foundation model , journal =. 2025 , doi =

  6. [6]

    Proceedings of the 42nd International Conference on Machine Learning , series =

    Qu, Jingang and Holzm. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , url =

  7. [7]

    arXiv preprint arXiv:2602.11139 , year =

    Qu, Jingang and Holzm. arXiv preprint arXiv:2602.11139 , year =

  8. [8]

    2025 , eprint =

    Zhang, Xingxuan and Ren, Gang and Yu, Han and Yuan, Hao and Wang, Hui and Li, Jiansheng and Wu, Jiayun and Mo, Lang and Mao, Li and Hao, Mingchao and Dai, Ningbo and Xu, Renzhe and Li, Shuyang and Zhang, Tianyang and He, Yue and Wang, Yuanrui and Zhang, Yunjia and Xu, Zijing and others , title =. 2025 , eprint =

  9. [9]

    2025 , eprint =

    Pfefferle, Alexander and Hog, Johannes and Purucker, Lennart and Hutter, Frank , title =. 2025 , eprint =

  10. [10]

    Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

    Erickson, Nick and Purucker, Lennart and Tschalzev, Andrej and Holzm. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

  11. [11]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  12. [12]

    2026 , url =

    Karpathy, Andrej , title =. 2026 , url =

  13. [13]

    2023 , url =

    Karpathy, Andrej , title =. 2023 , url =

  14. [14]

    modded-nanogpt: Speedrunning the

    Jordan, Keller and Bernstein, Jeremy and Rappazzo, Brendan and. modded-nanogpt: Speedrunning the. 2024 , url =

  15. [15]

    Proceedings of the 40th International Conference on Machine Learning , series =

    Geiping, Jonas and Goldstein, Tom , title =. Proceedings of the 40th International Conference on Machine Learning , series =. 2023 , url =

  16. [16]

    and Golestan, Keyvan and Yu, Guangwei and Caterini, Anthony L

    Ma, Junwei and Thomas, Valentin and Hosseinzadeh, Rasa and Labach, Alex and Kamkari, Hamidreza and Cresswell, Jesse C. and Golestan, Keyvan and Yu, Guangwei and Caterini, Anthony L. and Volkovs, Maksims , title =. The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) , year =

  17. [17]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Zhang, Biao and Sennrich, Rico , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2019 , url =

  18. [18]

    2022 , eprint =

    Kaddour, Jean , title =. 2022 , eprint =

  19. [19]

    First Conference on Language Modeling (COLM) , year =

    Sanyal, Sunny and Neerkaje, Atula Tejaswi and Kaddour, Jean and Kumar, Abhishek and Sanghavi, Sujay , title =. First Conference on Language Modeling (COLM) , year =

  20. [20]

    , title =

    Ba, Jimmy Lei and Kiros, Jamie Ryan and Hinton, Geoffrey E. , title =. 2016 , eprint =

  21. [21]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and K. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2019 , url =

  22. [22]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Defazio, Aaron and Yang, Xingyu Alice and Mehta, Harsh and Mishchenko, Konstantin and Khaled, Ahmed and Cutkosky, Ashok , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =