arxiv: 2603.03805 · v4 · submitted 2026-03-04 · 💻 cs.LG · cs.AI· cs.DB

Recognition: unknown

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

Yanbo Wang , Jiaxuan You , Chuan Shi , Muhan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DB

keywords relational databasesin-context learningsynthetic pre-trainingfoundation modelsstructural causal modelsfew-shot learningrelational prediction

0 comments

The pith

RDB-PFN learns relational in-context adaptation by pre-training a transformer solely on millions of synthetic databases generated from structural causal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Relational databases hold most business data yet have no foundation models like those for text or images, mainly because real RDBs are private and structurally varied. This paper introduces RDB-PFN as the first model trained entirely on synthetic relational data. A Relational Prior Generator draws infinite streams of single-table and multi-table databases from structural causal models, allowing pre-training on more than two million tasks. Once trained, the lightweight model adapts to any unseen real relational database through in-context learning and delivers strong few-shot performance across 19 real-world prediction benchmarks, beating graph and single-table baselines on the same linearized inputs.

Core claim

The central claim is that pre-training a transformer on over two million synthetic relational tasks produced by a Relational Prior Generator from Structural Causal Models equips the model with genuine in-context learning for relational prediction, so that it can be applied to any new real-world database instantly and still outperform graph-based and single-table baselines on 19 held-out relational tasks while remaining lightweight and fast at inference.

What carries the argument

The Relational Prior Generator, which creates diverse synthetic single-table and relational databases from Structural Causal Models to supply the scale and structural variety needed for pre-training.

Load-bearing premise

Synthetic relational databases generated from structural causal models capture enough of the join patterns, heterogeneity, and statistical properties of real-world RDBs for the trained model to generalize.

What would settle it

Run RDB-PFN on a fresh collection of real-world relational tasks whose join structures and distributions differ markedly from those produced by the Relational Prior Generator; if performance drops below the graph and single-table baselines, the generalization claim is falsified.

read the original abstract

Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce $\textbf{RDB-PFN}$, the first relational foundation model trained purely via $\textbf{synthetic data}$. Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a $\textbf{Relational Prior Generator}$ to create an infinite stream of diverse RDBs from scratch. Pre-training on $\textbf{over 2 million}$ synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine $\textbf{in-context learning}$. Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines (given the same DFS-linearized inputs), while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RDB-PFN pre-trains a lightweight in-context learner on millions of synthetic RDBs from SCM priors and claims better few-shot results than graph or single-table baselines on 19 real tasks, but the synthetic-to-real distributional match is unverified.

read the letter

The main move is to generate an infinite supply of synthetic relational databases from structural causal models, then pre-train a PFN-style model on over 2 million single-table and multi-table tasks so it can do genuine in-context adaptation on new real RDBs after DFS linearization. This is the first attempt I have seen to scale the PFN approach beyond single tables to full relational schemas. The paper does well by keeping the architecture small, emphasizing fast inference, and releasing code. The reported outperformance on 19 real-world relational prediction tasks over graph-based and single-table foundation-model baselines is the concrete result worth checking. The approach directly tackles the data scarcity problem for RDB foundation models by staying entirely in synthetic pre-training. The soft spot is the untested assumption that the Relational Prior Generator produces join patterns, table counts, foreign-key degrees, and correlation structures close enough to real enterprise RDBs. The abstract gives no histograms or statistical comparisons between the synthetic corpus and the evaluation databases, so it is possible the gains come from a narrower synthetic distribution rather than robust generalization. Without the full methods section it is also hard to judge baseline strength, input preprocessing details, or whether statistical tests back the claims. This is for researchers working on tabular foundation models, relational learning, or synthetic data pipelines for structured business data. A reader focused on in-context learning or data-efficient methods for heterogeneous schemas would find it useful to read. It deserves a serious referee because the core idea is new and the empirical claims, if they hold up under scrutiny, would matter for practical structured-data AI. I would send it to peer review to get the generator fidelity and experimental controls examined.

Referee Report

2 major / 1 minor

Summary. The paper introduces RDB-PFN, the first relational foundation model trained purely via synthetic data. It uses a Relational Prior Generator based on Structural Causal Models to create an infinite stream of diverse RDBs, pre-training on over 2 million synthetic single-table and relational tasks. The model is claimed to enable genuine in-context learning for instant adaptation to new databases, achieving strong few-shot performance on 19 real-world relational prediction tasks while outperforming graph-based and single-table foundation-model baselines (on the same DFS-linearized inputs) with a lightweight architecture and fast inference. Code is released at the provided GitHub link.

Significance. If the central claims hold, this would represent a meaningful advance in relational data modeling by showing that synthetic pre-training from SCM-based priors can address data scarcity and structural heterogeneity in RDBs, extending the PFN paradigm beyond single tables. The lightweight design and fast inference, combined with code availability for reproducibility, would position RDB-PFN as a practical foundation model for relational tasks where real data is private or limited.

major comments (2)

[Abstract] Abstract: The claim that RDB-PFN 'achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines' is presented without any details on the specific tasks, baseline implementations, performance metrics, statistical tests, or ablation studies. This absence makes it impossible to verify the empirical support for the headline result.
[Abstract] Abstract: The central generalization argument rests on the Relational Prior Generator producing synthetic RDBs that capture real-world structural heterogeneity, join patterns, and statistical properties, yet no quantitative distributional comparisons (such as table-count histograms, foreign-key degree distributions, or correlation structures) are provided between the >2M synthetic corpus and the 19 evaluation RDBs. This assumption is load-bearing for the transfer from synthetic pre-training to real tasks.

minor comments (1)

The abstract references 'DFS-linearized inputs' without defining the linearization procedure or citing its origin, which reduces clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that RDB-PFN 'achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines' is presented without any details on the specific tasks, baseline implementations, performance metrics, statistical tests, or ablation studies. This absence makes it impossible to verify the empirical support for the headline result.

Authors: We appreciate the referee's concern about the level of detail in the abstract. The manuscript provides comprehensive details on the 19 real-world tasks, baseline implementations (including graph-based models and single-table foundation models using identical DFS-linearized inputs), performance metrics, statistical tests, and ablation studies in the dedicated Experiments section. To make the abstract more informative, we have revised it to briefly reference these aspects, such as the evaluation on standard relational benchmarks and the use of metrics like AUC. This change has been implemented in the revised version. revision: yes
Referee: [Abstract] Abstract: The central generalization argument rests on the Relational Prior Generator producing synthetic RDBs that capture real-world structural heterogeneity, join patterns, and statistical properties, yet no quantitative distributional comparisons (such as table-count histograms, foreign-key degree distributions, or correlation structures) are provided between the >2M synthetic corpus and the 19 evaluation RDBs. This assumption is load-bearing for the transfer from synthetic pre-training to real tasks.

Authors: We acknowledge that explicit quantitative comparisons would strengthen the presentation of the generalization argument. The Relational Prior Generator is designed based on Structural Causal Models to generate diverse RDB structures that encompass a wide range of heterogeneity, join patterns, and statistical properties, as described in the method section. The empirical success on real tasks serves as validation. In the revised manuscript, we have added quantitative distributional comparisons, including table-count histograms, foreign-key degree distributions, and correlation structure analyses between the synthetic pre-training corpus and the 19 evaluation RDBs. This addition directly addresses the concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via external synthetic generation

full rationale

The paper's core pipeline generates an infinite stream of synthetic RDBs from scratch using a Relational Prior Generator based on Structural Causal Models, pre-trains RDB-PFN on over 2 million such tasks, and evaluates in-context adaptation on 19 separate real-world tasks. No equations or claims reduce the reported performance to a fit on the evaluation data, a self-referential definition, or a load-bearing self-citation chain; the synthetic corpus is produced independently of the test RDBs, and the architecture is a standard transformer without ansatz smuggling or renaming of known results. The empirical claims therefore rest on external data generation rather than internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the assumption that structural causal models can produce sufficiently realistic synthetic RDBs; no explicit free parameters or invented entities are described beyond the model architecture itself.

axioms (1)

domain assumption Structural Causal Models can generate diverse and representative relational database schemas and data distributions
Invoked to justify the Relational Prior Generator creating an infinite stream of training RDBs

pith-pipeline@v0.9.0 · 5515 in / 1249 out tokens · 51793 ms · 2026-05-15T16:54:03.121535+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KumoRFM-2: Scaling Foundation Models for Relational Learning
cs.LG 2026-04 unverdicted novelty 6.0

KumoRFM-2 pre-trains on synthetic and real relational data across row, column, foreign-key and cross-sample axes, injects task information early, and achieves up to 8% gains over supervised baselines on 41 benchmarks ...