Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise

Bo Long; Deepak Agarwal; Jelena Markovic-Voronov; Liuqing Li; Yi Wang

arxiv: 2605.18832 · v1 · pith:MXJOGVY7new · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise

Bo Long , Deepak Agarwal , Jelena Markovic-Voronov , Yi Wang , Liuqing Li This is my paper

Pith reviewed 2026-05-20 21:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformerbayesian filteringkalman filterkriginguncertainty estimationsequential recommendationlarge language modelsprecision tracking

0 comments

The pith

The standard Transformer is a degenerate case of a Bayesian Filtering Transformer that tracks precision using Kalman updates and kriging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the uniform treatment of every token in a Transformer ignores varying levels of uncertainty common in real data, such as sparse histories for new users or noisy labels and contexts. A sympathetic reader would care because restoring principled precision handling could improve robustness in recommendation systems and language models without redesigning the architecture from scratch. The authors reinterpret attention as precision-weighted kriging, residual connections as adaptive-gain Kalman updates, and feed-forward networks as dynamics models that propagate precision via a Jacobian combined with process noise. Observation precision is obtained from a parameter-free Restricted Maximum Likelihood estimator using a conjugate prior, all computed inside each layer. Replacing any Transformer layer with this Bayesian Filtering Transformer yields measurable gains on cold-start and noisy-data benchmarks.

Core claim

We show this uniformity is a degenerate case of our Bayesian Filtering Transformer (BFT): attention becomes precision-weighted kriging, the residual connection becomes a Kalman update with adaptive gain, and the FFN becomes a dynamics model propagating precision via a Jacobian--plus--process-noise rule. Observation precision comes from a parameter-free Restricted Maximum Likelihood (REML) estimator with a conjugate Bayesian prior. BFT replaces any Transformer layer with negligible overhead.

What carries the argument

Bayesian Filtering Transformer (BFT), which reinterprets standard Transformer components to track and propagate per-token precision through Kalman filtering, kriging, and process-noise dynamics.

Load-bearing premise

Observation precision can be computed from a parameter-free REML estimator with conjugate prior inside each layer without disrupting overall training dynamics.

What would settle it

Apply both a standard Transformer and the corresponding BFT version to a sequential recommendation dataset dominated by cold-start users, then check whether BFT fails to improve metrics specifically on rare items.

Figures

Figures reproduced from arXiv: 2605.18832 by Bo Long, Deepak Agarwal, Jelena Markovic-Voronov, Liuqing Li, Yi Wang.

**Figure 1.** Figure 1: LLM fine-tuning under noise. F1 on held-out test set (mean ±1 SE band, 20 paired same-seed runs per cell). (a) SQuAD answer-token corruption at np∈{0, 10, 20, 30, 40}%: BFT improves over SFT on clean data (np=0: +8.3%, p=0.001) and grows with corruption above np=10; pooled n=100, +3.87% relative F1, p<0.001. The np=10 cell is a within-noise null for both BFT and the focal-loss baseline (Appendix O.2). (b) … view at source ↗

**Figure 2.** Figure 2: The Qiu et al. (2025) sigmoid gate is structurally invariant to context length, while the same [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Quantitative consistency of the REML formula. Anchoring [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Per-(layer, head) BOS attention mass versus [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Beauty sensitivity, flat-plateau knobs. The Kalman gain initializer [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Beauty sensitivity, tunable knobs. Warmup steps, precision LR multiplier, and [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Kalman gain K and prior precision λˆ at Layer 2 stratify by item training frequency. Each point is a frequency-bin mean over the 18,357 Sports items. (a) Rare items run at lower K (≈ 0.77) than popular items (≈ 0.85), Spearman ρ=0.126, p<10−65 . A lower K means the residual update incorporates less of the attention innovation et and leaves more of the prior state ht intact—i.e., for rare items the model re… view at source ↗

read the original abstract

The Transformer is the foundational building block of modern AI, yet offers no principled handling of \emph{uncertainty}, which is prevalent in real applications: cold-start tokens with sparse histories in sequential recommendation, heterogeneous signal quality in language models, and attention sinks induced by unconstrained softmax. Every token is treated with uniform confidence. We show this uniformity is a degenerate case of our \emph{Bayesian Filtering Transformer} (BFT): attention becomes precision-weighted kriging, the residual connection becomes a Kalman update with adaptive gain, and the FFN becomes a dynamics model propagating precision via a Jacobian--plus--process-noise rule. Observation precision comes from a parameter-free Restricted Maximum Likelihood (REML) estimator with a conjugate Bayesian prior. BFT replaces any Transformer layer with negligible overhead. On sequential recommendation, BFT applied to three major architectures yields significant gains on six benchmarks, with the largest improvements on cold-start users and rare items where uncertainty is highest. On supervised fine-tuning of large language models with noisy data, BFT improves robustness in two regimes: noisy supervision (token-label corruption in question answering) and noisy context (retrieval-augmented QA with real RAG distractors). A single principled modification -- restoring precision -- unlocks substantial headroom across both classical sequence-modeling and modern LLM regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BFT recasts Transformer layers as Kalman and kriging steps with REML precisions, delivering gains on cold-start recommendation and noisy LLM fine-tuning, but the exact degeneracy to standard attention is not yet shown to hold without extra terms.

read the letter

The main takeaway is that this work reframes attention as precision-weighted kriging, the residual as an adaptive Kalman update, and the FFN as a dynamics step with Jacobian and process noise, all driven by a parameter-free REML estimator inside each layer. That framing is the actual novelty, and it is presented as recovering the usual Transformer when precisions become uniform.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Bayesian Filtering Transformer (BFT) as a generalization of the standard Transformer in which attention is reinterpreted as precision-weighted kriging, residual connections as a Kalman update with adaptive gain, and the FFN as a dynamics model that propagates precision via a Jacobian-plus-process-noise rule. Observation precision is obtained from a parameter-free REML estimator equipped with a conjugate Bayesian prior. The manuscript claims that uniform-precision standard attention/residual/FFN is recovered exactly as a degenerate case of BFT, and reports empirical gains when BFT replaces layers in recommendation and noisy-LLM fine-tuning settings, with largest improvements on cold-start and high-uncertainty regimes.

Significance. If the claimed exact reductions hold and the REML step integrates without breaking end-to-end training, the work supplies a principled uncertainty-aware extension to Transformers that could be useful for cold-start, heterogeneous-quality, and noisy-supervision regimes. The reported negligible overhead and consistent gains across three architectures and six benchmarks constitute a concrete strength; reproducible code or machine-checked derivations would further strengthen the contribution.

major comments (2)

[§4.1–4.3] §4.1–4.3 (degeneracy derivations): the central claim that standard Transformer components are recovered exactly when BFT reduces to the uniform-precision limit requires an explicit, independent derivation showing that the REML estimator (with conjugate prior) produces observation precisions that make precision-weighted kriging collapse to dot-product attention, the Kalman gain to 1, and the Jacobian-plus-process-noise rule to ordinary FFN, without residual terms or layer-specific hyperparameters. The current presentation leaves this reduction implicit.
[§3.2] §3.2 (REML estimator): the statement that REML is strictly parameter-free and closed-form inside each layer is load-bearing for both the degeneracy claim and the “negligible overhead” assertion. If the estimator requires iterative optimization or matrix projections whose fixed points depend on layer statistics that do not vanish in the uniform limit, the mathematical reduction fails and the interpretations do not hold.

minor comments (2)

[§2] Notation for precision variables and process-noise covariance should be introduced with a single consolidated table or diagram in §2 to avoid repeated re-definition across sections.
[§5] The experimental tables would benefit from an additional column or row reporting the overhead (FLOPs or wall-clock) of the REML step relative to the baseline Transformer layer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below, agreeing where the presentation can be strengthened and providing clarifications where the underlying claims already hold.

read point-by-point responses

Referee: [§4.1–4.3] §4.1–4.3 (degeneracy derivations): the central claim that standard Transformer components are recovered exactly when BFT reduces to the uniform-precision limit requires an explicit, independent derivation showing that the REML estimator (with conjugate prior) produces observation precisions that make precision-weighted kriging collapse to dot-product attention, the Kalman gain to 1, and the Jacobian-plus-process-noise rule to ordinary FFN, without residual terms or layer-specific hyperparameters. The current presentation leaves this reduction implicit.

Authors: We agree that the reduction to the standard Transformer is currently presented implicitly and would benefit from an explicit derivation. In the revised manuscript we will add a dedicated subsection to §4 that derives the uniform-precision limit in three steps: (i) the conjugate prior in the REML estimator yields constant observation precision when token variances are identical; (ii) precision-weighted kriging then reduces exactly to scaled dot-product attention; (iii) the Kalman gain becomes unity and the Jacobian-plus-process-noise propagation collapses to the ordinary FFN. The derivation introduces no layer-specific hyperparameters or residual terms. We will also supply a short appendix with the algebraic details so that the reduction can be verified independently. revision: yes
Referee: [§3.2] §3.2 (REML estimator): the statement that REML is strictly parameter-free and closed-form inside each layer is load-bearing for both the degeneracy claim and the “negligible overhead” assertion. If the estimator requires iterative optimization or matrix projections whose fixed points depend on layer statistics that do not vanish in the uniform limit, the mathematical reduction fails and the interpretations do not hold.

Authors: The REML estimator is formulated with a conjugate prior that admits an exact closed-form solution per layer; it consists of a single evaluation of the sample precision from the quadratic form of the activations and requires no iterative optimization or iterative matrix projections. When all tokens share the same precision (the uniform limit), the estimator returns a uniform value by construction, so the fixed point is consistent across layers and the degeneracy holds. We will expand §3.2 with the explicit closed-form expression together with a short verification that the uniform case is recovered without contradiction, thereby reinforcing both the mathematical reduction and the negligible-overhead claim. revision: yes

Circularity Check

1 steps flagged

Degeneracy claim reduces to definitional special case upon introducing precision variables

specific steps

self definitional [Abstract]
"We show this uniformity is a degenerate case of our Bayesian Filtering Transformer (BFT): attention becomes precision-weighted kriging, the residual connection becomes a Kalman update with adaptive gain, and the FFN becomes a dynamics model propagating precision via a Jacobian--plus--process-noise rule. Observation precision comes from a parameter-free Restricted Maximum Likelihood (REML) estimator with a conjugate Bayesian prior."

The paper defines BFT by augmenting the Transformer with precision tracking, Kalman-style updates, kriging, and process-noise propagation, then asserts that the original uniform-precision Transformer is recovered exactly when precisions are uniform. This makes the degeneracy statement true by construction of the generalized model and the choice of REML prior; the abstract contains no separate derivation that begins from the unmodified attention/residual/FFN equations and recovers them as a limit without presupposing the precision variables.

full rationale

The paper's core interpretive claim is that standard Transformer uniformity is exactly recovered as a degenerate case of BFT. This is presented via the abstract's mapping (attention to precision-weighted kriging, residual to Kalman update, FFN to Jacobian-plus-process-noise). The reduction is achieved by setting the newly introduced observation precisions to uniform values and invoking the REML estimator in its uniform limit. Because the BFT framework is constructed around these precision terms and the REML prior, the equivalence holds by the model's parameterization rather than by an independent derivation that starts from the original Transformer equations and arrives at the same limit without the added machinery. The abstract provides no explicit equations demonstrating that the REML step vanishes without residuals or layer-specific statistics in the high-precision limit. This satisfies the self-definitional pattern for the load-bearing interpretation, though the empirical results on recommendation and LLM tasks may still stand independently.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities beyond the standard Transformer components and the REML estimator; the precision variable is presented as derived rather than postulated.

pith-pipeline@v0.9.0 · 5772 in / 1079 out tokens · 76495 ms · 2026-05-20T21:39:22.295952+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show this uniformity is a degenerate case of our Bayesian Filtering Transformer (BFT): attention becomes precision-weighted kriging, the residual connection becomes a Kalman update with adaptive gain, and the FFN becomes a dynamics model propagating precision via a Jacobian--plus--process-noise rule.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Observation precision comes from a parameter-free Restricted Maximum Likelihood (REML) estimator with a conjugate Bayesian prior.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

[1]

L. M. Bui, T. Tran Huu, D. Dinh, T. M. Nguyen, and T. N. Hoang. Revisiting kernel attention with correlated Gaussian process representation. In UAI, 2024

work page 2024
[2]

P. Zhou, Q. Ye, Y. Xie, J. Gao, S. Wang, J. B. Kim, C. You, and S. Kim. Attention calibration for Transformer-based sequential recommendation (AC-TSR). In CIKM, 2023

work page 2023
[3]

Chen and Y

W. Chen and Y. Li. Calibrating Transformers via sparse Gaussian processes. In ICLR, 2023

work page 2023
[4]

H. Chen, Y. Lin, M. Pan, L. Wang, C.-C. M. Yeh, X. Li, Y. Zheng, F. Wang, and H. Yang. Denoising self-attentive sequential recommendation. In RecSys, 2022

work page 2022
[5]

Y. Chen, Q. Tao, F. Tonin, and J. A. K. Suykens. Self-attention through kernel-eigen pair sparse variational Gaussian processes. In ICML, 2024

work page 2024
[6]

Farquhar, J

S. Farquhar, J. Kossen, L. Kuhn, and Y. Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630:625--630, 2024

work page 2024
[7]

Gal and Z

Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation. In ICML, 2016

work page 2016
[8]

S. Haykin. Kalman Filtering and Neural Networks. Wiley, 2004

work page 2004
[9]

J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak. Infinite attention: NNGP and NTK for deep attention networks. In ICML, 2020

work page 2020
[10]

P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 221--233. University of California Press, 1967

work page 1967
[11]

Izacard, M

G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave. Unsupervised dense information retrieval with contrastive learning. TMLR, 2022

work page 2022
[12]

R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35--45, 1960

work page 1960
[13]

Kang and J

W.-C. Kang and J. McAuley. Self-attentive sequential recommendation. In ICDM, 2018

work page 2018
[14]

L. Kish. Survey Sampling. Wiley, 1965

work page 1965
[15]

D. G. Krige. A statistical approach to some basic mine valuation problems on the Witwatersrand. J. South African Inst. Mining Metall., 52(6):119--139, 1951

work page 1951
[16]

Kwiatkowski, J

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural Questions: A benchmark for question answering research. TACL, 7:453--466, 2019

work page 2019
[17]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, 2017

work page 2017
[18]

J. Li, R. Socher, and S. C. H. Hoi. DivideMix: Learning with noisy labels as semi-supervised learning. In ICLR, 2020

work page 2020
[19]

T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\'ar. Focal loss for dense object detection. In ICCV, 2017

work page 2017
[20]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. TACL, 12:157--173, 2024

work page 2024
[21]

Mathéron

G. Mathéron. Principles of geostatistics. Economic Geology, 58(8):1246--1266, 1963

work page 1963
[22]

E. A. Nadaraya. On estimating regression. Theory Probab. Appl., 9(1):141--142, 1964

work page 1964
[23]

S. K. Nielsen, L. U. Abdullaev, R. S. Y. Teo, and T. M. Nguyen. Elliptical attention. In NeurIPS, 2024

work page 2024
[24]

Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huang, D. Liu, J. Zhou, and J. Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. In NeurIPS, 2025

work page 2025
[25]

Rajpurkar, J

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016

work page 2016
[26]

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006

work page 2006
[27]

Revach, N

G. Revach, N. Shlezinger, X. Ni, A. L. L\'opez Escoriza, R. J. G. van Sloun, and Y. C. Eldar. KalmanNet: Neural network aided Kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70:1532--1547, 2022

work page 2022
[28]

F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang. BERT4Rec: Sequential recommendation with bidirectional encoder representations from Transformer. In CIKM, 2019

work page 2019
[29]

M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models. arXiv:2402.17762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, 2017

work page 2017
[31]

W. Wang, F. Feng, X. He, L. Nie, and T.-S. Chua. Denoising implicit feedback for recommendation. In WSDM, 2021

work page 2021
[32]

H. White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4):817--838, 1980

work page 1980
[33]

G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. In ICLR, 2024

work page 2024
[34]

J. Zhai, L. Liao, X. Liu, Y. Wang, R. Li, X. Cao, L. Gao, Z. Gong, F. Gu, J. He, Y. Lu, and Y. Shi. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations. In ICML, 2024

work page 2024
[35]

TinyLlama: An Open-Source Small Language Model

P. Zhang, G. Zeng, T. Wang, and W. Lu. TinyLlama: An open-source small language model. arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

L. M. Bui, T. Tran Huu, D. Dinh, T. M. Nguyen, and T. N. Hoang. Revisiting kernel attention with correlated Gaussian process representation. In UAI, 2024

work page 2024

[2] [2]

P. Zhou, Q. Ye, Y. Xie, J. Gao, S. Wang, J. B. Kim, C. You, and S. Kim. Attention calibration for Transformer-based sequential recommendation (AC-TSR). In CIKM, 2023

work page 2023

[3] [3]

Chen and Y

W. Chen and Y. Li. Calibrating Transformers via sparse Gaussian processes. In ICLR, 2023

work page 2023

[4] [4]

H. Chen, Y. Lin, M. Pan, L. Wang, C.-C. M. Yeh, X. Li, Y. Zheng, F. Wang, and H. Yang. Denoising self-attentive sequential recommendation. In RecSys, 2022

work page 2022

[5] [5]

Y. Chen, Q. Tao, F. Tonin, and J. A. K. Suykens. Self-attention through kernel-eigen pair sparse variational Gaussian processes. In ICML, 2024

work page 2024

[6] [6]

Farquhar, J

S. Farquhar, J. Kossen, L. Kuhn, and Y. Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630:625--630, 2024

work page 2024

[7] [7]

Gal and Z

Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation. In ICML, 2016

work page 2016

[8] [8]

S. Haykin. Kalman Filtering and Neural Networks. Wiley, 2004

work page 2004

[9] [9]

J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak. Infinite attention: NNGP and NTK for deep attention networks. In ICML, 2020

work page 2020

[10] [10]

P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 221--233. University of California Press, 1967

work page 1967

[11] [11]

Izacard, M

G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave. Unsupervised dense information retrieval with contrastive learning. TMLR, 2022

work page 2022

[12] [12]

R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35--45, 1960

work page 1960

[13] [13]

Kang and J

W.-C. Kang and J. McAuley. Self-attentive sequential recommendation. In ICDM, 2018

work page 2018

[14] [14]

L. Kish. Survey Sampling. Wiley, 1965

work page 1965

[15] [15]

D. G. Krige. A statistical approach to some basic mine valuation problems on the Witwatersrand. J. South African Inst. Mining Metall., 52(6):119--139, 1951

work page 1951

[16] [16]

Kwiatkowski, J

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural Questions: A benchmark for question answering research. TACL, 7:453--466, 2019

work page 2019

[17] [17]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, 2017

work page 2017

[18] [18]

J. Li, R. Socher, and S. C. H. Hoi. DivideMix: Learning with noisy labels as semi-supervised learning. In ICLR, 2020

work page 2020

[19] [19]

T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\'ar. Focal loss for dense object detection. In ICCV, 2017

work page 2017

[20] [20]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. TACL, 12:157--173, 2024

work page 2024

[21] [21]

Mathéron

G. Mathéron. Principles of geostatistics. Economic Geology, 58(8):1246--1266, 1963

work page 1963

[22] [22]

E. A. Nadaraya. On estimating regression. Theory Probab. Appl., 9(1):141--142, 1964

work page 1964

[23] [23]

S. K. Nielsen, L. U. Abdullaev, R. S. Y. Teo, and T. M. Nguyen. Elliptical attention. In NeurIPS, 2024

work page 2024

[24] [24]

Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huang, D. Liu, J. Zhou, and J. Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. In NeurIPS, 2025

work page 2025

[25] [25]

Rajpurkar, J

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016

work page 2016

[26] [26]

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006

work page 2006

[27] [27]

Revach, N

G. Revach, N. Shlezinger, X. Ni, A. L. L\'opez Escoriza, R. J. G. van Sloun, and Y. C. Eldar. KalmanNet: Neural network aided Kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70:1532--1547, 2022

work page 2022

[28] [28]

F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang. BERT4Rec: Sequential recommendation with bidirectional encoder representations from Transformer. In CIKM, 2019

work page 2019

[29] [29]

M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models. arXiv:2402.17762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, 2017

work page 2017

[31] [31]

W. Wang, F. Feng, X. He, L. Nie, and T.-S. Chua. Denoising implicit feedback for recommendation. In WSDM, 2021

work page 2021

[32] [32]

H. White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4):817--838, 1980

work page 1980

[33] [33]

G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. In ICLR, 2024

work page 2024

[34] [34]

J. Zhai, L. Liao, X. Liu, Y. Wang, R. Li, X. Cao, L. Gao, Z. Gong, F. Gu, J. He, Y. Lu, and Y. Shi. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations. In ICML, 2024

work page 2024

[35] [35]

TinyLlama: An Open-Source Small Language Model

P. Zhang, G. Zeng, T. Wang, and W. Lu. TinyLlama: An open-source small language model. arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024