Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference

Alan Ferrari

arxiv: 2605.28384 · v1 · pith:NILQ7NL2new · submitted 2026-05-27 · 💻 cs.LG

Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference

Alan Ferrari This is my paper

Pith reviewed 2026-06-29 14:41 UTC · model grok-4.3

classification 💻 cs.LG

keywords meta-attentionbayesian routingtransformer efficiencyper-token routingvariational inferenceattention mechanismsELBO objectiveDirichlet prior

0 comments

The pith

A Bayesian Meta-Controller routes each transformer token to full, linear or local attention via a compute-aware Dirichlet prior, projecting 25.1 percent FLOP cost under hard routing versus 59.3 percent for the prior-free baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Meta-Attention to replace the uniform application of one attention mechanism across all tokens with per-token dynamic routing among full softmax attention, linear kernel attention, and sliding-window local attention. A Bayesian Meta-Controller performs this routing by treating mechanism selection as posterior inference under a Dirichlet prior, with an amortised variational posterior trained via an ELBO that incorporates both task performance and mechanism cost. The design supplies uncertainty estimates to control the shift from soft to hard routing and avoids routing collapse without extra balancing terms. A sympathetic reader would care because uniform attention wastes computation on tokens that could use cheaper alternatives, and the reported Phase 1 results on a Tiny LM benchmark indicate substantially lower projected normalised FLOP cost together with lower routing entropy.

Core claim

Meta-Attention treats per-token mechanism selection as posterior inference under a compute-aware Dirichlet prior. Routing weights are the output of an amortised variational posterior q(alpha | x_t; phi) trained with an Evidence Lower Bound objective that jointly encodes task performance and attention-mechanism cost. This produces principled routing uncertainty estimates that govern the soft-to-hard routing transition, mitigates routing collapse without ad hoc load-balancing losses, and yields better compute-performance trade-offs than deterministic or prior-free learned routing at negligible overhead. On a Tiny LM benchmark the learned routing distribution implies a projected normalised FLOP

What carries the argument

The amortised variational posterior q(alpha | x_t; phi) that outputs per-token routing weights under a compute-aware Dirichlet prior and is trained with an ELBO balancing task performance against attention cost.

If this is right

The Bayesian controller's learned distribution projects a normalised FLOP cost of 25.1 percent under hard routing.
Routing entropy falls from 55.8 percent to 43.3 percent.
The Dirichlet prior prevents routing collapse to a single mechanism while the prior-free model defaults to full attention.
Uncertainty estimates from the posterior govern the transition from soft to hard routing without extra loss terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the projected FLOP savings materialise on larger models and diverse tasks, the approach could reduce the inference cost of transformers enough to widen their practical deployment.
The same amortised variational controller could be applied to route other per-token decisions such as feed-forward layer width or quantisation level.
Hardware-specific cost models inserted into the ELBO might further tighten the trade-off between accuracy and measured latency.
Ablations that vary the strength of the Dirichlet prior would show how much the reported entropy reduction depends on the prior versus the variational training.

Load-bearing premise

The amortised variational posterior trained with the ELBO that jointly encodes task performance and attention-mechanism cost produces routing decisions whose projected FLOP savings and entropy reductions hold under hard routing on real workloads.

What would settle it

Measure actual FLOPs and task accuracy when the trained Meta-Attention model is run with hard routing on a standard language-modeling benchmark and compare the numbers directly against the prior-free baseline.

read the original abstract

Standard transformer architectures apply a single attention mechanism uniformly across all tokens and sequence positions, irrespective of local context or computational budget. We propose Meta-Attention, a framework that dynamically routes each token to the most appropriate attention strategy -- full softmax attention, linear (kernel) attention, or sliding-window local attention -- via a Bayesian Meta-Controller. Unlike prior routing approaches that use deterministic or prior-free learned routing, the Meta-Controller treats per-token mechanism selection as posterior inference under a compute-aware Dirichlet prior: routing weights are the output of an amortised variational posterior q(alpha | x_t; phi) trained with an Evidence Lower Bound (ELBO) objective that jointly encodes task performance and attention-mechanism cost. This design produces principled routing uncertainty estimates that govern the soft-to-hard routing transition, mitigates routing collapse without ad hoc load-balancing losses, and yields better compute-performance trade-offs than deterministic or prior-free learned routing at negligible overhead. Phase 1 empirical results on a Tiny LM benchmark confirm core predictions: the Bayesian controller's learned routing distribution implies a projected normalised FLOP cost of 25.1% under hard routing, vs. 59.3% for the prior-free baseline (-34.2 pp), and reduces routing entropy from 55.8% to 43.3% (-12.5 pp), demonstrating that the Dirichlet prior prevents routing collapse while the non-Bayesian model defaults to full attention. We present the Bayesian architecture, ELBO training objective, and a Phase 1 PyTorch prototype validating forward-pass correctness, posterior diversity, and a controlled ablation against a prior-free baseline. Code available at: https://github.com/KFEAL/meta-attention

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core claim of projected FLOP savings under hard routing comes from the soft posterior rather than measured hard-routed runs, which undercuts the main empirical pitch.

read the letter

The paper's main pitch is a Bayesian per-token router that picks among full softmax, linear, and sliding-window attention using a compute-aware Dirichlet prior and an amortised variational posterior trained by a joint ELBO. That setup is new relative to the deterministic or prior-free routers cited in the abstract.

What works is the framing itself. Treating mechanism choice as posterior inference gives a clean way to get uncertainty estimates and avoid collapse without extra balancing losses. The Phase 1 prototype and linked code at least confirm the forward pass runs and that the posterior stays diverse on the Tiny LM benchmark.

The soft spots are in the results. The headline numbers (25.1 % normalised FLOP under hard routing versus 59.3 % for the baseline, plus the entropy drop) are projections derived from the learned soft weights. Nothing in the abstract shows an actual hard-routed forward pass on the same benchmark that realises those savings. The ELBO is defined on soft routing, so the projection step is an assumption, not a measurement. Model size, training data, exact baselines, and how the FLOP cost is computed from the routing distribution are also missing, which makes the -34 pp gap hard to evaluate.

This is early-stage work aimed at people already working on dynamic routing and efficient inference. A reader who wants to see Bayesian methods tried on attention selection will find the architecture and objective worth looking at. The central argument is coherent on its own terms even if the current evidence is thin.

I would send it to peer review. The idea is distinct enough that referees can give useful feedback on whether the hard-routing gap can be closed and what the right benchmarks are.

Referee Report

2 major / 0 minor

Summary. The paper introduces Meta-Attention, a per-token routing framework for transformers that uses a Bayesian Meta-Controller with an amortised variational posterior q(α | x_t; ϕ) under a compute-aware Dirichlet prior. Routing decisions among full softmax, linear, and sliding-window attention are trained via an ELBO that jointly optimizes task loss and mechanism cost. The central empirical claim from Phase 1 results on a Tiny LM benchmark is that the learned routing distribution projects to a normalised FLOP cost of 25.1% under hard routing (vs. 59.3% for a prior-free baseline) and reduces routing entropy from 55.8% to 43.3%.

Significance. If the hard-routing projections are confirmed by direct measurement and the method scales, the use of a Dirichlet prior to control routing uncertainty and avoid collapse without auxiliary losses would represent a principled advance over deterministic routing schemes for dynamic compute allocation in transformers. The open-source PyTorch prototype is a positive contribution for reproducibility.

major comments (2)

[Abstract] Abstract: The claim that the Bayesian controller implies a projected normalised FLOP cost of 25.1% under hard routing (–34.2 pp vs. baseline) is derived solely from the soft posterior q(α | x_t; ϕ) that was itself optimised by the ELBO on the same Tiny LM benchmark data. No results are reported from actual hard-routed forward passes (e.g., via argmax or sampling at inference time), so it remains unverified whether the projected savings and entropy reduction are realised when the controller is replaced by a deterministic mechanism.
[Abstract] Abstract / Phase 1 results: The manuscript supplies no information on Tiny LM model size, training data, exact baseline implementations, error bars, or the precise procedure used to convert the learned routing distribution into the reported FLOP-cost projection. These omissions make the quantitative claims impossible to assess or reproduce from the given description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our Phase 1 results. We address the two major comments below and will revise the manuscript to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the Bayesian controller implies a projected normalised FLOP cost of 25.1% under hard routing (–34.2 pp vs. baseline) is derived solely from the soft posterior q(α | x_t; ϕ) that was itself optimised by the ELBO on the same Tiny LM benchmark data. No results are reported from actual hard-routed forward passes (e.g., via argmax or sampling at inference time), so it remains unverified whether the projected savings and entropy reduction are realised when the controller is replaced by a deterministic mechanism.

Authors: We agree that the reported figures are projections computed from the learned soft posterior rather than direct measurements obtained by replacing the controller with a deterministic (argmax or sampled) mechanism at inference time. Phase 1 was intended to validate the ELBO training procedure, posterior diversity, and the effect of the Dirichlet prior on routing entropy. We will add a new set of experiments that perform actual hard routing during forward passes and report the resulting task performance and measured FLOP costs to confirm whether the projected savings materialise. revision: yes
Referee: [Abstract] Abstract / Phase 1 results: The manuscript supplies no information on Tiny LM model size, training data, exact baseline implementations, error bars, or the precise procedure used to convert the learned routing distribution into the reported FLOP-cost projection. These omissions make the quantitative claims impossible to assess or reproduce from the given description.

Authors: We acknowledge that the current manuscript does not provide these implementation and experimental details. We will expand the Methods and Experimental Setup sections to specify the Tiny LM architecture, training corpus, baseline routing model, number of random seeds for error bars, and the exact formula used to obtain the normalised FLOP-cost projection from the posterior routing weights. revision: yes

Circularity Check

1 steps flagged

Projected FLOP savings and entropy reductions are post-hoc calculations from the ELBO-fitted routing distribution on the same benchmark

specific steps

fitted input called prediction [Abstract]
"the Bayesian controller's learned routing distribution implies a projected normalised FLOP cost of 25.1% under hard routing, vs. 59.3% for the prior-free baseline (-34.2 pp), and reduces routing entropy from 55.8% to 43.3% (-12.5 pp)"

The routing distribution is the output of amortised variational inference trained with an ELBO that jointly encodes task performance and attention-mechanism cost on the Tiny LM benchmark. The quoted 'projected' FLOP cost and entropy values are then obtained by applying a hardening operation to this same fitted posterior; the numerical savings are therefore the direct result of the optimization rather than a prediction on independent data or actual hard-routed execution.

full rationale

The paper's core empirical claims (25.1% vs 59.3% FLOP cost, 43.3% vs 55.8% entropy) are computed directly from the learned q(alpha | x_t; phi) that was optimized by the ELBO on the Tiny LM benchmark. The ELBO explicitly encodes mechanism cost, so the reported savings under the hard-routing projection are a direct algebraic consequence of the fitted parameters rather than an independent measurement on actual hard-routed forward passes. This matches the fitted-input-called-prediction pattern with no external validation or held-out evaluation shown in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on standard variational inference assumptions plus new components (Meta-Controller, Dirichlet prior) whose parameters are fitted; no independent evidence is supplied for the invented routing entity beyond the Phase 1 prototype.

free parameters (2)

phi (amortised variational parameters)
Parameters of q(alpha | x_t; phi) trained end-to-end via ELBO.
Dirichlet prior parameters
Compute-aware Dirichlet prior over routing weights.

axioms (1)

standard math Amortised variational inference via ELBO yields a useful approximation to the posterior over per-token routing decisions
Invoked in the training objective that jointly encodes task performance and mechanism cost.

invented entities (1)

Meta-Controller no independent evidence
purpose: Per-token Bayesian routing among attention mechanisms
New architectural component introduced to produce routing weights and uncertainty estimates.

pith-pipeline@v0.9.1-grok · 5831 in / 1380 out tokens · 44491 ms · 2026-06-29T14:41:17.562647+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 15 canonical work pages · 12 internal anchors

[1]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

2017
[2]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[3]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

Choromanski, V

K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, et al. Rethinking attention with performers. InICLR, 2021. 10

2021
[5]

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022

2022
[6]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 23(120):1–39, 2022

2022
[7]

Kitaev, Ł

N. Kitaev, Ł. Kaiser, and A. Levskaya. Reformer: The efficient transformer. InICLR, 2020

2020
[8]

A. Roy, M. Saffar, A. Vaswani, and D. Grangier. Efficient content-based sparse attention with routing transformers. TACL, 9:53–68, 2021

2021
[9]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

D. Raposo, S. Ritter, B. Richards, T. Lillicrap, P. C. Humphreys, and A. Santoro. Mixture of depths: Dynamically allocating compute in transformer-based language models. arXiv:2404.02258, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Zhang, Y

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, et al. MoEfication: Transformer feed-forward layers are mixtures of experts. InFindings of ACL, 2022

2022
[11]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Dao and A

T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InICML, 2024

2024
[13]

B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, et al. RWKV: Reinventing RNNs for the transformer era. InFindings of EMNLP, 2023

2023
[14]

Y . Sun, L. Dong, S. Huang, S. Ma, Y . Xia, J. Xue, et al. Retentive network: A successor to transformer for large language models. arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Arora, S

S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, et al. Zoology: Measuring and improving recall in efficient language models. InICLR, 2024

2024
[16]

G. Chen, Y . Zhang, J. Su, W. Xu, S. Pan, Y . Wang, et al. Attention residuals. arXiv:2603.15031, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huang, D. Liu, J. Zhou, and J. Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink- free. arXiv:2505.06708.NeurIPS 2025 Best Paper, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Zucchet, F

N. Zucchet, F. d’Angelo, A. K. Lampinen, and S. C. Y . Chan. The emergence of sparse attention: Impact of data distribution and benefits of repetition. arXiv:2505.17863.NeurIPS 2025 Oral, 2025

work page arXiv 2025
[19]

M. Yau, E. Akyurek, J. Mao, J. B. Tenenbaum, S. Jegelka, and J. Andreas. Learning linear attention in polynomial time. arXiv:2410.10101.NeurIPS 2025 Oral, 2024

work page arXiv 2025
[20]

J. Shah, G. Bikshandi, K. Zhang, T. Dao, V . Mirrokni, and C. Re. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv:2407.08608.NeurIPS 2024 Spotlight, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

MoBA: Mixture of Block Attention for Long-Context LLMs

Y . Lu et al. MoBA: Mixture of block attention for long-context LLMs. arXiv:2502.13189, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

J. Yuan et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. arXiv:2502.11089, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

The Bayesian Geometry of Transformer Attention

N. Agarwal, S. R. Dalal, and V . Misra. The Bayesian geometry of transformer attention. arXiv:2512.22471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

A. Y . Li and M. Wicker. Variational routing: A scalable Bayesian framework for calibrated mixture-of-experts transformers. arXiv:2603.09453, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Boncoraglio, H

F. Boncoraglio, H. Cui, F. Krzakala, and L. Zdeborová. Bayes optimal learning of attention-indexed models. arXiv:2506.01582, 2025

work page arXiv 2025
[26]

Figurnov, S

M. Figurnov, S. Mohamed, and A. Mnih. Implicit reparameterization gradients. InNeurIPS, 2018

2018
[27]

A. Ferrari. Meta-Attention: Bayesian per-token routing for efficient transformer inference – reference implemen- tation.https://github.com/KFEAL/meta-attention, 2025. 11

2025

[1] [1]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

2017

[2] [2]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[3] [3]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[4] [4]

Choromanski, V

K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, et al. Rethinking attention with performers. InICLR, 2021. 10

2021

[5] [5]

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022

2022

[6] [6]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 23(120):1–39, 2022

2022

[7] [7]

Kitaev, Ł

N. Kitaev, Ł. Kaiser, and A. Levskaya. Reformer: The efficient transformer. InICLR, 2020

2020

[8] [8]

A. Roy, M. Saffar, A. Vaswani, and D. Grangier. Efficient content-based sparse attention with routing transformers. TACL, 9:53–68, 2021

2021

[9] [9]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

D. Raposo, S. Ritter, B. Richards, T. Lillicrap, P. C. Humphreys, and A. Santoro. Mixture of depths: Dynamically allocating compute in transformer-based language models. arXiv:2404.02258, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Zhang, Y

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, et al. MoEfication: Transformer feed-forward layers are mixtures of experts. InFindings of ACL, 2022

2022

[11] [11]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Dao and A

T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InICML, 2024

2024

[13] [13]

B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, et al. RWKV: Reinventing RNNs for the transformer era. InFindings of EMNLP, 2023

2023

[14] [14]

Y . Sun, L. Dong, S. Huang, S. Ma, Y . Xia, J. Xue, et al. Retentive network: A successor to transformer for large language models. arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Arora, S

S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, et al. Zoology: Measuring and improving recall in efficient language models. InICLR, 2024

2024

[16] [16]

G. Chen, Y . Zhang, J. Su, W. Xu, S. Pan, Y . Wang, et al. Attention residuals. arXiv:2603.15031, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huang, D. Liu, J. Zhou, and J. Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink- free. arXiv:2505.06708.NeurIPS 2025 Best Paper, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Zucchet, F

N. Zucchet, F. d’Angelo, A. K. Lampinen, and S. C. Y . Chan. The emergence of sparse attention: Impact of data distribution and benefits of repetition. arXiv:2505.17863.NeurIPS 2025 Oral, 2025

work page arXiv 2025

[19] [19]

M. Yau, E. Akyurek, J. Mao, J. B. Tenenbaum, S. Jegelka, and J. Andreas. Learning linear attention in polynomial time. arXiv:2410.10101.NeurIPS 2025 Oral, 2024

work page arXiv 2025

[20] [20]

J. Shah, G. Bikshandi, K. Zhang, T. Dao, V . Mirrokni, and C. Re. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv:2407.08608.NeurIPS 2024 Spotlight, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

MoBA: Mixture of Block Attention for Long-Context LLMs

Y . Lu et al. MoBA: Mixture of block attention for long-context LLMs. arXiv:2502.13189, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

J. Yuan et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. arXiv:2502.11089, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

The Bayesian Geometry of Transformer Attention

N. Agarwal, S. R. Dalal, and V . Misra. The Bayesian geometry of transformer attention. arXiv:2512.22471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

A. Y . Li and M. Wicker. Variational routing: A scalable Bayesian framework for calibrated mixture-of-experts transformers. arXiv:2603.09453, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Boncoraglio, H

F. Boncoraglio, H. Cui, F. Krzakala, and L. Zdeborová. Bayes optimal learning of attention-indexed models. arXiv:2506.01582, 2025

work page arXiv 2025

[26] [26]

Figurnov, S

M. Figurnov, S. Mohamed, and A. Mnih. Implicit reparameterization gradients. InNeurIPS, 2018

2018

[27] [27]

A. Ferrari. Meta-Attention: Bayesian per-token routing for efficient transformer inference – reference implemen- tation.https://github.com/KFEAL/meta-attention, 2025. 11

2025