MOCA: A Transformer-based Modular Causal Inference Framework with One-way Cross-attention and Cutting Feedback

Debashis Ghosh; Lei Wang

arxiv: 2604.23107 · v1 · submitted 2026-04-25 · 📊 stat.ML · cs.LG· stat.ME

MOCA: A Transformer-based Modular Causal Inference Framework with One-way Cross-attention and Cutting Feedback

Lei Wang , Debashis Ghosh This is my paper

Pith reviewed 2026-05-08 07:24 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.ME

keywords causal inferencetransformerone-way attentionmodular designconfounder adjustmenttreatment effect estimationgradient detachment

0 comments

The pith

MOCA uses a modular transformer with one-way cross-attention and gradient detachment to keep outcome information from leaking into treatment representations during causal effect estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MOCA as a transformer-based method for estimating causal effects from observational data. It separates treatment modeling from outcome modeling in a modular architecture and applies one-way cross-attention to adjust for confounders. A cutting-feedback step detaches gradients so that the outcome loss cannot update the treatment module. This setup aims to maintain causal directionality while using the flexibility of transformers for complex data. The authors test it on simulated cases with linear, nonlinear, heavy-tailed, and high-dimensional features plus two real-world datasets, where it matches or exceeds standard estimators such as IPW, AIPW, X-learner, TARNet, and DragonNet.

Core claim

MOCA separates treatment and outcome modeling through a modular transformer design and performs confounder adjustment with one-way cross-attention. Gradient detachment implements a cutting-feedback strategy that stops the outcome loss from updating the treatment module. This preserves directional information flow while retaining transformer representational power. Across simulated scenarios that include linear, nonlinear, heavy-tailed, hidden confounding, and high-dimensional settings, and on the Infant Health and Development Program and Dehejia-Wahba datasets, MOCA produces competitive or improved estimates relative to IPW, AIPW, X-learner, TARNet, and DragonNet.

What carries the argument

Modular one-way cross-attention combined with gradient-detachment cutting feedback, which enforces one-directional information flow from the treatment module to the outcome module without back-propagation of outcome signals.

If this is right

Treatment propensity estimates remain stable under nonlinear and high-dimensional confounding because outcome signals cannot update the treatment module.
Average treatment effect errors stay low relative to joint-training baselines across linear, heavy-tailed, and hidden-confounding regimes.
Real-world performance on datasets such as IHDP and Dehejia-Wahba matches or exceeds IPW, AIPW, X-learner, TARNet, and DragonNet.
The modular separation allows independent inspection and refinement of the treatment and outcome components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar one-way attention blocks could be inserted into other deep representation learners to enforce causal directionality without redesigning the entire architecture.
The cutting-feedback idea might generalize to time-varying or multi-treatment settings where strict separation of information flows is required.
If the detachment works reliably, hybrid models could combine the transformer modules with classical propensity-score weighting for further robustness checks.

Load-bearing premise

The one-way cross-attention and gradient detachment together prevent outcome-related information from influencing the treatment-side representations even when the treatment assignment and outcome mechanisms are complex and nonlinear.

What would settle it

Train MOCA on simulated data where the outcome is constructed to exert a direct nonlinear influence on treatment assignment; inspect the learned treatment representations for any measurable dependence on outcome noise after training.

Figures

Figures reproduced from arXiv: 2604.23107 by Debashis Ghosh, Lei Wang.

**Figure 1.** Figure 1: Illustration for MOCA framework with one-way attention. view at source ↗

**Figure 2.** Figure 2: Information flow for MOCA with treatment and outcome modules. view at source ↗

**Figure 3.** Figure 3: ATE bias boxplot for testing sample size 100. view at source ↗

**Figure 4.** Figure 4: ATE RMSE boxplot for testing sample size 100. view at source ↗

read the original abstract

Causal effect estimation from observational data requires careful adjustment for confounding. Classical estimators such as inverse probability weighting and augmented inverse probability weighting are effective under favorable model specification, but may become unstable when treatment assignment and outcome mechanisms are complex, non-linear, and high-dimensional. Machine learning and representation learning approaches improve flexibility, yet joint training can allow outcome-related information to influence treatment-side representations, which is undesirable from a causal perspective. We propose MOCA (Modular One-way Causal Attention), a transformer-based framework that separates treatment and outcome modeling through a modular design, and performs confounder adjustment using a one-way attention mechanism. A cutting-feedback strategy, implemented via gradient detachment, prevents the outcome loss from updating the treatment module. This design preserves directional information flow while retaining the representational power of transformer architectures for causal inference. Across multiple simulated scenarios, including linear, nonlinear, heavy-tailed, hidden confounding, and high-dimensional settings, MOCA shows competitive or improved performance relative to IPW, AIPW, X-learner, TARNet, and DragonNet. We further illustrate the method on the Infant Health and Development Program dataset and the Dehejia-Wahba dataset as real-world benchmarks. These results suggest that modular attention with one-way information flow provides a promising and interpretable direction for causal inference with modern deep learning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOCA tries to fix leakage in neural causal models with a transformer that uses one-way cross-attention and gradient detachment, but the abstract gives no numbers or checks to show it works.

read the letter

MOCA is a transformer architecture for causal effect estimation that splits treatment and outcome modeling into modules. It uses one-way cross-attention for confounder adjustment and a cutting-feedback step that detaches gradients so the outcome loss cannot update the treatment side. This is meant to keep treatment representations free of outcome information, which is a real issue in joint-training setups like TARNet and DragonNet. The authors position it against classical estimators that break down under nonlinearity and high dimensions, and they test it on linear, nonlinear, heavy-tailed, hidden-confounding, and high-dimensional simulations plus the IHDP and Dehejia-Wahba datasets. That range of scenarios is a reasonable choice and shows they are thinking about practical observational settings in health and social science data. The modular separation with directional flow is the actual new piece here; prior work did not combine these elements in a transformer this way. The paper does a clean job stating the causal motivation without obvious circularity in its claims. The soft spots are straightforward. The abstract asserts competitive or better performance but supplies no metrics, intervals, ablations, or implementation details. There is also no diagnostic evidence, such as checks on representation similarity or mutual information, that the one-way attention and gradient cut actually block leakage when mechanisms are jointly nonlinear. Indirect coupling through confounders could still occur, and without those tests the performance gains remain unverified. This work is aimed at researchers building deep models for causal inference who want architectural options beyond standard representation learners. A reader already familiar with TARNet-style models would see a plausible direction and might borrow the one-way attention idea, but the lack of concrete results limits immediate use. It deserves a serious referee because the proposal is coherent, cites the relevant baselines, and targets a genuine limitation without reducing to fitted quantities from its own parameters. I would send it to peer review so the authors can supply the missing experiments and diagnostics.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MOCA, a transformer-based modular causal inference framework that separates treatment and outcome modeling via one-way cross-attention for confounder adjustment and a cutting-feedback mechanism (gradient detachment) to block outcome-related information from updating treatment representations. It claims this design yields competitive or improved performance relative to IPW, AIPW, X-learner, TARNet, and DragonNet across simulated settings (linear/nonlinear, heavy-tailed, hidden confounding, high-dimensional) and on the IHDP and Dehejia-Wahba real-world datasets, while preserving the representational power of transformers.

Significance. If the empirical results and the claimed separation hold, the work offers a constructive direction for incorporating attention mechanisms into causal inference while respecting directional information flow and avoiding undesirable leakage from joint training. The modular design addresses a recognized limitation of end-to-end representation learners such as TARNet and DragonNet. Credit is due for framing the architecture explicitly around causal desiderata rather than treating the transformer as a black-box estimator.

major comments (2)

[3 (Method) and 4 (Experiments)] The central performance claim attributes gains to the modular separation achieved by one-way cross-attention and cutting feedback. However, the manuscript provides no diagnostic evidence (e.g., treatment-embedding similarity to outcome residuals, mutual-information estimates, or ablation with/without gradient detachment) confirming that outcome loss does not influence treatment-module outputs under jointly nonlinear and high-dimensional mechanisms. This verification is load-bearing for the causal justification of the reported improvements over TARNet/DragonNet.
[4 (Experiments)] §4, simulated scenarios: while competitive performance is asserted across linear, nonlinear, heavy-tailed, hidden-confounding, and high-dimensional regimes, the text supplies no quantitative metrics, confidence intervals, or statistical significance tests for the differences versus DragonNet or X-learner. Without these, the magnitude and robustness of the claimed gains cannot be assessed.

minor comments (2)

[3.1] Notation for the one-way attention mask and the precise gradient-detachment operator should be formalized with an equation rather than described only in prose.
[1] The abstract and introduction would benefit from a brief statement of the precise causal estimand (e.g., ATE or CATE) targeted by the framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below and outline the revisions we plan to make.

read point-by-point responses

Referee: [3 (Method) and 4 (Experiments)] The central performance claim attributes gains to the modular separation achieved by one-way cross-attention and cutting feedback. However, the manuscript provides no diagnostic evidence (e.g., treatment-embedding similarity to outcome residuals, mutual-information estimates, or ablation with/without gradient detachment) confirming that outcome loss does not influence treatment-module outputs under jointly nonlinear and high-dimensional mechanisms. This verification is load-bearing for the causal justification of the reported improvements over TARNet/DragonNet.

Authors: We agree that explicit diagnostic evidence would strengthen the causal interpretation of our results. While the one-way cross-attention and gradient detachment are designed to enforce separation, we acknowledge the absence of direct verification in the current manuscript. In the revised version, we will add ablation studies comparing performance with and without the cutting-feedback mechanism across the simulated scenarios. Additionally, we will include analyses such as cosine similarity between treatment embeddings and outcome residuals, as well as mutual information estimates where feasible, to demonstrate that outcome information does not leak into the treatment module. revision: yes
Referee: [4 (Experiments)] §4, simulated scenarios: while competitive performance is asserted across linear, nonlinear, heavy-tailed, hidden-confounding, and high-dimensional regimes, the text supplies no quantitative metrics, confidence intervals, or statistical significance tests for the differences versus DragonNet or X-learner. Without these, the magnitude and robustness of the claimed gains cannot be assessed.

Authors: We appreciate this observation. The current manuscript reports performance in tables but does not include confidence intervals or formal statistical tests. To address this, we will augment the experimental section with standard deviations from multiple random seeds, 95% confidence intervals, and paired t-tests or Wilcoxon tests to evaluate the significance of differences between MOCA and the baselines (particularly DragonNet and X-learner) in each simulated setting. revision: yes

Circularity Check

0 steps flagged

No circularity in MOCA architectural proposal or empirical claims

full rationale

The paper introduces MOCA as an independent architectural design using one-way cross-attention and gradient detachment (cutting feedback) to enforce modular separation between treatment and outcome modules. Performance claims rest on direct comparisons to external baselines (IPW, AIPW, X-learner, TARNet, DragonNet) across simulated scenarios and real datasets (IHDP, Dehejia-Wahba), with no mathematical derivation, parameter fitting, or prediction step that reduces to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the method is presented as a self-contained empirical proposal without tautological reductions or renamed known results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method builds on standard causal inference assumptions and introduces new architectural components whose behavior is validated only through simulation and benchmark comparisons.

free parameters (1)

Transformer hyperparameters
Number of layers, attention heads, and learning rates are chosen or fitted during model training.

axioms (1)

domain assumption Standard causal assumptions of consistency, positivity, and no unmeasured confounding (conditional on observed covariates)
Implicitly required for the confounder adjustment to identify causal effects from observational data.

invented entities (1)

One-way cross-attention with cutting feedback no independent evidence
purpose: To enforce directional information flow and prevent outcome leakage into treatment representations
Newly proposed mechanism in this framework.

pith-pipeline@v0.9.0 · 5541 in / 1381 out tokens · 48971 ms · 2026-05-08T07:24:17.271233+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Linear complexity self-attention with 3rd order polynomials

Francesca Babiloni, Italo Marras, Jiankang Deng, Filippos Kokkinos, Matteo Maggioni, Grigorios Chrysos, and Stefanos Zafeiriou. Linear complexity self-attention with 3rd order polynomials. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 0 (11): 0 12726--12737, 2023

work page 2023
[2]

Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61 0 (4): 0 962--973, 2005

work page 2005
[3]

John Barnard and Donald B. Rubin. Small-sample degrees of freedom with multiple imputation. Biometrika, 86 0 (4): 0 948--955, 1999

work page 1999
[4]

M. J. Bayarri, J. O. Berger, and F. Liu. Modularization in bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis, 2009

work page 2009
[5]

Semi-modular inference: Enhanced learning in multi-modular models by tempering the influence of components

Chris Carmona and Geoff Nicholls. Semi-modular inference: Enhanced learning in multi-modular models by tempering the influence of components. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pages 4226--4235, 2020

work page 2020
[6]

Dehejia and Sadek Wahba

Rajeev H. Dehejia and Sadek Wahba. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94 0 (448): 0 1053--1062, 1999

work page 1999
[7]

A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery

Jianqing Fan, Weichen Wang, and Zhihua Zhu. A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. The Annals of Statistics, 49 0 (3): 0 1239--1266, 2021

work page 2021
[8]

D. T. Frazier and D. J. Nott. Cutting feedback and modularized analyses in generalized bayesian inference. Bayesian Analysis, 20 0 (4): 0 1647--1675, 2025

work page 2025
[9]

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2 edition, 2009

work page 2009
[10]

Iacus, Gary King, and Giuseppe Porro

Stefano M. Iacus, Gary King, and Giuseppe Porro. A theory of statistical inference for matching methods in causal research. Political Analysis, 27 0 (1): 0 46--68, 2019

work page 2019
[11]

Learning representations for counterfactual inference

Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on Machine Learning, pages 3020--3029, 2016

work page 2016
[12]

o ren R. K \

S \"o ren R. K \"u nzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116 0 (10): 0 4156--4165, 2019

work page 2019
[13]

Set transformer: A framework for attention-based permutation-invariant neural networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 3744--3753, 2019

work page 2019
[14]

Thomas, and Fan Li

Fan Li, Lauren E. Thomas, and Fan Li. Addressing extreme propensity scores via the overlap weights. American Journal of Epidemiology, 188 0 (1): 0 250--257, 2019

work page 2019
[15]

Liu and R

Y. Liu and R. J. Goudie. A general framework for cutting feedback within modularized bayesian inference. Journal of the Royal Statistical Society Series B: Statistical Methodology, 87 0 (4): 0 1171--1199, 2025

work page 2025
[16]

Mooij, David Sontag, Richard Zemel, and Max Welling

Christos Louizos, Uri Shalit, Joris M. Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models. Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[17]

Frauen, and Stefan Feuerriegel

Viktoriia Melnychuk, D. Frauen, and Stefan Feuerriegel. Causal transformer for estimating counterfactual outcomes. In Proceedings of the 39th International Conference on Machine Learning, pages 15293--15329, 2022

work page 2022
[18]

Robins, Miguel A

James M. Robins, Miguel A. Hern \'a n, and Babette Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11 0 (5): 0 550--560, 2000

work page 2000
[19]

Rosenbaum and Donald B

Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70 0 (1): 0 41--55, 1983

work page 1983
[20]

Donald B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100 0 (469): 0 322--331, 2005

work page 2005
[21]

Adapting neural networks for the estimation of treatment effects

Claudia Shi, David Blei, and Victor Veitch. Adapting neural networks for the estimation of treatment effects. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[22]

VanderWeele and Onyebuchi A

Tyler J. VanderWeele and Onyebuchi A. Arah. Unmeasured confounding for general outcomes, treatments, and confounders: Bias formulas for sensitivity analysis. Epidemiology, 22 0 (1): 0 42--52, 2011

work page 2011
[23]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[24]

A. Wang, X. Piao, X. Zhang, Y. Guo, and Y. Zhang. Rat: Residual attention transformer for tabular data. IEEE Transactions on Big Data, 2025

work page 2025

[1] [1]

Linear complexity self-attention with 3rd order polynomials

Francesca Babiloni, Italo Marras, Jiankang Deng, Filippos Kokkinos, Matteo Maggioni, Grigorios Chrysos, and Stefanos Zafeiriou. Linear complexity self-attention with 3rd order polynomials. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 0 (11): 0 12726--12737, 2023

work page 2023

[2] [2]

Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61 0 (4): 0 962--973, 2005

work page 2005

[3] [3]

John Barnard and Donald B. Rubin. Small-sample degrees of freedom with multiple imputation. Biometrika, 86 0 (4): 0 948--955, 1999

work page 1999

[4] [4]

M. J. Bayarri, J. O. Berger, and F. Liu. Modularization in bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis, 2009

work page 2009

[5] [5]

Semi-modular inference: Enhanced learning in multi-modular models by tempering the influence of components

Chris Carmona and Geoff Nicholls. Semi-modular inference: Enhanced learning in multi-modular models by tempering the influence of components. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pages 4226--4235, 2020

work page 2020

[6] [6]

Dehejia and Sadek Wahba

Rajeev H. Dehejia and Sadek Wahba. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94 0 (448): 0 1053--1062, 1999

work page 1999

[7] [7]

A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery

Jianqing Fan, Weichen Wang, and Zhihua Zhu. A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. The Annals of Statistics, 49 0 (3): 0 1239--1266, 2021

work page 2021

[8] [8]

D. T. Frazier and D. J. Nott. Cutting feedback and modularized analyses in generalized bayesian inference. Bayesian Analysis, 20 0 (4): 0 1647--1675, 2025

work page 2025

[9] [9]

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2 edition, 2009

work page 2009

[10] [10]

Iacus, Gary King, and Giuseppe Porro

Stefano M. Iacus, Gary King, and Giuseppe Porro. A theory of statistical inference for matching methods in causal research. Political Analysis, 27 0 (1): 0 46--68, 2019

work page 2019

[11] [11]

Learning representations for counterfactual inference

Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on Machine Learning, pages 3020--3029, 2016

work page 2016

[12] [12]

o ren R. K \

S \"o ren R. K \"u nzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116 0 (10): 0 4156--4165, 2019

work page 2019

[13] [13]

Set transformer: A framework for attention-based permutation-invariant neural networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 3744--3753, 2019

work page 2019

[14] [14]

Thomas, and Fan Li

Fan Li, Lauren E. Thomas, and Fan Li. Addressing extreme propensity scores via the overlap weights. American Journal of Epidemiology, 188 0 (1): 0 250--257, 2019

work page 2019

[15] [15]

Liu and R

Y. Liu and R. J. Goudie. A general framework for cutting feedback within modularized bayesian inference. Journal of the Royal Statistical Society Series B: Statistical Methodology, 87 0 (4): 0 1171--1199, 2025

work page 2025

[16] [16]

Mooij, David Sontag, Richard Zemel, and Max Welling

Christos Louizos, Uri Shalit, Joris M. Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models. Advances in Neural Information Processing Systems, 30, 2017

work page 2017

[17] [17]

Frauen, and Stefan Feuerriegel

Viktoriia Melnychuk, D. Frauen, and Stefan Feuerriegel. Causal transformer for estimating counterfactual outcomes. In Proceedings of the 39th International Conference on Machine Learning, pages 15293--15329, 2022

work page 2022

[18] [18]

Robins, Miguel A

James M. Robins, Miguel A. Hern \'a n, and Babette Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11 0 (5): 0 550--560, 2000

work page 2000

[19] [19]

Rosenbaum and Donald B

Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70 0 (1): 0 41--55, 1983

work page 1983

[20] [20]

Donald B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100 0 (469): 0 322--331, 2005

work page 2005

[21] [21]

Adapting neural networks for the estimation of treatment effects

Claudia Shi, David Blei, and Victor Veitch. Adapting neural networks for the estimation of treatment effects. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[22] [22]

VanderWeele and Onyebuchi A

Tyler J. VanderWeele and Onyebuchi A. Arah. Unmeasured confounding for general outcomes, treatments, and confounders: Bias formulas for sensitivity analysis. Epidemiology, 22 0 (1): 0 42--52, 2011

work page 2011

[23] [23]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017

work page 2017

[24] [24]

A. Wang, X. Piao, X. Zhang, Y. Guo, and Y. Zhang. Rat: Residual attention transformer for tabular data. IEEE Transactions on Big Data, 2025

work page 2025