FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

Bingbing Fang; Junbo Zhao; Lu Chen; Sheng Zhang; Tang Qian; Tianyi Li; Yumeng Song; Yushuai Li; Zhenghang Song; Zhengke Hu

arxiv: 2603.16513 · v3 · pith:R7HS7UTEnew · submitted 2026-03-17 · 💻 cs.LG · cs.AI

FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

Zhenghang Song , Tang Qian , Lu Chen , Yushuai Li , Zhengke Hu , Bingbing Fang , Yumeng Song , Junbo Zhao

show 2 more authors

Sheng Zhang Tianyi Li

This is my paper

Pith reviewed 2026-05-21 10:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords structured data foundation modelslinear complexitypermutation invariancestate-space modelsdatabase benchmarksefficient attention mechanismsrobust pre-training

0 comments

The pith

FEAT replaces quadratic attention with linear dual-axis encoding for structured data while preserving permutation invariance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Structured data foundation models face a choice between quadratic slowdowns from full attention and order bias from simpler linear replacements that treat tuples as ordered sequences. The paper introduces FEAT to resolve this tension through a multi-layer dual-axis encoding architecture. It combines an adaptive-fusion bidirectional state-space model with convolutional gated linear attention to deliver cross-tuple context in linear time and maintain invariance to tuple reordering. A hybrid structural causal pre-training pipeline with robust reconstruction further targets the heavy-tailed distributions typical of real enterprise databases. If the approach holds, analysts could jointly model far larger collections of records than current methods permit without sacrificing representation quality.

Core claim

FEAT replaces quadratic attention with a multi-layer dual-axis encoding architecture integrating an adaptive-fusion bidirectional state-space model with convolutional gated linear attention, enabling cross-tuple contextualization in O(N) time while supporting permutation-invariant representation learning. It further adopts a hybrid structural causal pre-training pipeline with a robust reconstruction objective to improve robustness under real-world data skewness, consistently outperforming representative SFMs on zero-shot tasks across 12 real-world database benchmarks and scaling linearly with sample length for up to 50x faster inference.

What carries the argument

The multi-layer dual-axis encoding architecture that fuses an adaptive-fusion bidirectional state-space model (AFBM) and convolutional gated linear attention (Conv-GLA) to perform linear-time cross-tuple contextualization while enforcing permutation invariance.

If this is right

Joint processing of substantially more tuples from enterprise databases becomes computationally feasible.
Representation quality on heterogeneous real-world data improves through the hybrid pre-training objective.
Inference latency drops by up to 50 times relative to prior structured-data foundation models.
Zero-shot performance on database analysis tasks exceeds that of existing quadratic and linear baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-axis pattern could transfer to other unordered data types such as sets or graphs in different application domains.
Linear scaling may enable training directly on full-scale collections rather than on sampled subsets of records.
Similar fusion strategies might combine with additional efficient sequence layers to extend coverage to even broader data regimes.

Load-bearing premise

That the specific combination of adaptive-fusion bidirectional state-space modeling and convolutional gated linear attention truly preserves permutation invariance and avoids introducing any artificial order bias when applied to structured data tuples.

What would settle it

Direct measurement showing that random permutation of input tuple order changes the model's embeddings or downstream task accuracy, or that measured runtime grows faster than linearly with increasing numbers of tuples.

Figures

Figures reproduced from arXiv: 2603.16513 by Bingbing Fang, Junbo Zhao, Lu Chen, Sheng Zhang, Tang Qian, Tianyi Li, Yumeng Song, Yushuai Li, Zhenghang Song, Zhengke Hu.

**Figure 1.** Figure 1: The overview of FEAT. Sample-axis modeling. The feature-axis modeling stage models intra-sample feature dependencies by applying selfattention across the D feature columns of each sample, efficiently capturing local semantic correlations. The sample-axis modeling stage then models inter-sample dependencies along the sample dimension using a linear-complexity modeling mechanism, enabling efficient long-ran… view at source ↗

**Figure 2.** Figure 2: Pre-training methods of FEAT. M latent prototypes P ∈ RM×Droot representing underlying population clusters. For each sample i ∈ {1, . . . , N}, we draw an individual mixture weight vector from a Dirichlet distribution wi ∼ Dir(α). The root features are then synthesized as: xi,root = w⊤ i P ∈ R Droot (20) This explicitly injects sample-wise clustering structures commonly observed in structured patient/user … view at source ↗

read the original abstract

Structured data is widely used in domains such as healthcare, finance, and scientific data management. Recent studies on structured data foundation models (SFMs) aim to support data analysis and mining tasks over such data, but still face scalability and generalization challenges when applied to real-world enterprise databases. First, many SFMs rely on full self-attention, which introduces an O(N^2) computational bottleneck and limits the number of tuples that can be processed jointly. Second, directly replacing attention with linear-complexity sequence models may conflict with the permutation-invariant nature of structured data, introducing artificial order bias and degrading representation quality. Moreover, models trained only on synthetic data may struggle to generalize to the heavy-tailed and heterogeneous distributions commonly found in real-world databases. To address these challenges, we propose FEAT, a linear-complexity foundation model for extremely large structured data. FEAT replaces quadratic attention with a multi-layer dual-axis encoding architecture. It integrates an adaptive-fusion bidirectional state-space model (AFBM) with convolutional gated linear attention (Conv-GLA), enabling cross-tuple contextualization in O(N) time while supporting permutation-invariant representation learning. To improve robustness under real-world data skewness, FEAT further adopts a hybrid structural causal pre-training pipeline with a robust reconstruction objective. Experiments on 12 real-world database benchmarks show that FEAT consistently outperforms representative SFMs on zero-shot tasks and scales linearly with structured-data sample length, achieving up to 50x faster inference latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FEAT claims a linear-complexity dual-axis model with AFBM and Conv-GLA for structured data but the abstract leaves the invariance mechanism and experimental controls under-specified.

read the letter

The main thing to know is that this paper puts forward FEAT as a foundation model that swaps quadratic attention for a multi-layer dual-axis encoder combining an adaptive-fusion bidirectional state-space model with convolutional gated linear attention. The goal is O(N) cross-tuple context while staying permutation-invariant and handling real skewed data through hybrid causal pre-training. They report better zero-shot results than prior SFMs on 12 real database benchmarks plus up to 50x faster inference and linear scaling with sample length.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FEAT, a linear-complexity foundation model for extremely large structured data. It replaces quadratic self-attention with a multi-layer dual-axis encoding architecture integrating an adaptive-fusion bidirectional state-space model (AFBM) and convolutional gated linear attention (Conv-GLA) to achieve O(N) cross-tuple contextualization while supporting permutation-invariant representation learning. A hybrid structural causal pre-training pipeline with robust reconstruction objective is used to handle real-world data skewness. Experiments on 12 real-world database benchmarks reportedly show consistent outperformance over representative SFMs on zero-shot tasks, linear scaling with sample length, and up to 50x faster inference.

Significance. If the architectural claims on exact permutation invariance and the empirical results with proper controls hold, the work would represent a meaningful advance in scalable structured data foundation models by mitigating the O(N^2) bottleneck of attention while addressing generalization to heterogeneous real-world distributions. The hybrid pre-training approach for skewness robustness could be a useful contribution if validated through ablations.

major comments (2)

[Abstract / Architecture description] Abstract and architecture description: the central claim that the AFBM + Conv-GLA integration in the dual-axis encoding delivers exact permutation invariance for unordered tuples lacks an explicit symmetrization mechanism (e.g., order-agnostic pooling or averaging over permutations). Bidirectional SSMs are sequential and convolutional operators impose local ordering; without a concrete construction that cancels these dependencies while preserving O(N) scaling, residual order bias remains a risk to the invariance assertion.
[Experiments] Experimental evaluation: the reported consistent outperformance on 12 benchmarks and linear scaling are presented without details on data splits, error bars, statistical significance, ablation studies isolating AFBM vs. Conv-GLA contributions, or controls for baseline implementations. This absence leaves the empirical support for the central claims unverifiable from the given text.

minor comments (2)

[Method] Clarify the precise definition and fusion rule for AFBM and Conv-GLA in the dual-axis layers to allow reproduction.
[Experiments] Add a table or figure summarizing the 12 benchmarks with key statistics (e.g., tuple counts, feature heterogeneity) to contextualize the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We address each of the major comments in detail below and have made revisions to the manuscript to incorporate the suggested improvements where applicable.

read point-by-point responses

Referee: [Abstract / Architecture description] Abstract and architecture description: the central claim that the AFBM + Conv-GLA integration in the dual-axis encoding delivers exact permutation invariance for unordered tuples lacks an explicit symmetrization mechanism (e.g., order-agnostic pooling or averaging over permutations). Bidirectional SSMs are sequential and convolutional operators impose local ordering; without a concrete construction that cancels these dependencies while preserving O(N) scaling, residual order bias remains a risk to the invariance assertion.

Authors: We thank the referee for this comment. To strengthen the presentation of our permutation invariance claim, we will revise the abstract and architecture section to explicitly describe the symmetrization mechanism employed. Specifically, we will detail how the dual-axis encoding uses order-independent operations, including symmetric fusion in AFBM and translation-equivariant but globally pooled Conv-GLA, ensuring no residual order bias. We will also include a formal proof in the supplementary material that the architecture is exactly permutation-invariant while maintaining linear complexity. revision: yes
Referee: [Experiments] Experimental evaluation: the reported consistent outperformance on 12 benchmarks and linear scaling are presented without details on data splits, error bars, statistical significance, ablation studies isolating AFBM vs. Conv-GLA contributions, or controls for baseline implementations. This absence leaves the empirical support for the central claims unverifiable from the given text.

Authors: We agree that the experimental section requires more rigorous documentation to allow for full verification of the results. In the revised version of the manuscript, we will expand Section 4 to include: detailed descriptions of the data splits used for each of the 12 real-world benchmarks; error bars computed as standard deviations over 5 independent runs with different random seeds; results of statistical significance tests (e.g., Wilcoxon signed-rank tests) against the baseline models; comprehensive ablation studies that isolate the impact of the AFBM component versus the Conv-GLA component; and explicit controls and implementation details for all baseline SFMs to ensure fair comparison. These additions will significantly enhance the reproducibility and credibility of our empirical findings. revision: yes

Circularity Check

0 steps flagged

No circularity detected in architecture proposal

full rationale

The paper proposes FEAT as a novel multi-layer dual-axis encoding architecture that integrates AFBM and Conv-GLA to achieve O(N) cross-tuple contextualization while claiming permutation invariance. No equations, derivations, or self-citations are exhibited in the abstract or described claims that reduce the performance assertions or invariance property to fitted parameters, prior self-referential results, or inputs by construction. The central contribution is presented as an original construction with external benchmark validation on 12 real-world databases, making the derivation chain self-contained rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the unproven assumption that the new components maintain key properties of structured data without external verification or proofs in the provided abstract.

axioms (1)

domain assumption The proposed AFBM and Conv-GLA integration supports permutation-invariant representation learning for structured data.
Invoked directly in the abstract as an enabling property of the architecture.

invented entities (2)

AFBM (adaptive-fusion bidirectional state-space model) no independent evidence
purpose: To enable linear-time cross-tuple contextualization while preserving invariance.
New component introduced in the paper to replace quadratic attention.
Conv-GLA (convolutional gated linear attention) no independent evidence
purpose: To provide gated linear attention with convolutional elements for structured data.
New component introduced alongside AFBM.

pith-pipeline@v0.9.0 · 5824 in / 1305 out tokens · 56457 ms · 2026-05-21T10:22:45.144769+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 6 internal anchors

[1]

SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training.arXiv preprint arXiv:2106.01342,

Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C Bayan Bruss, and Tom Goldstein. SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training.arXiv preprint arXiv:2106.01342,

work page arXiv
[2]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901,

work page 1901
[4]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Limix: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025a

Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, et al. Limix: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025a. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polos...

work page arXiv
[6]

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data.arXiv preprint arXiv:2502.05564,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

work page arXiv
[8]

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, et al. TabPFN-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

State-space models for tabular prior-data fitted networks.arXiv preprint arXiv:2510.14573,

Felix Koch, Marcel Wever, Fabian Raisch, and Benjamin Tischler. State-space models for tabular prior-data fitted networks.arXiv preprint arXiv:2510.14573,

work page arXiv
[10]

Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025b

Xiyuan Zhang, Danielle C Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W Mahoney, et al. Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025b. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea G...

work page arXiv
[11]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Real-tabpfn: Im- proving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971,

Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, and Frank Hutter. Real-tabpfn: Im- proving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971,

work page arXiv
[13]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Nick Erickson, Jonas Mueller, Alexander Shirkov, Alexander Smola, and Hang Wang. Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505,

work page internal anchor Pith review Pith/arXiv arXiv 2003

[1] [1]

SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training.arXiv preprint arXiv:2106.01342,

Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C Bayan Bruss, and Tom Goldstein. SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training.arXiv preprint arXiv:2106.01342,

work page arXiv

[2] [2]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901,

work page 1901

[4] [4]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Limix: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025a

Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, et al. Limix: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025a. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polos...

work page arXiv

[6] [6]

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data.arXiv preprint arXiv:2502.05564,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

work page arXiv

[8] [8]

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, et al. TabPFN-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

State-space models for tabular prior-data fitted networks.arXiv preprint arXiv:2510.14573,

Felix Koch, Marcel Wever, Fabian Raisch, and Benjamin Tischler. State-space models for tabular prior-data fitted networks.arXiv preprint arXiv:2510.14573,

work page arXiv

[10] [10]

Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025b

Xiyuan Zhang, Danielle C Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W Mahoney, et al. Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025b. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea G...

work page arXiv

[11] [11]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Real-tabpfn: Im- proving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971,

Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, and Frank Hutter. Real-tabpfn: Im- proving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971,

work page arXiv

[13] [13]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Nick Erickson, Jonas Mueller, Alexander Shirkov, Alexander Smola, and Hang Wang. Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505,

work page internal anchor Pith review Pith/arXiv arXiv 2003