Cubit: Token Mixer with Kernel Ridge Regression

Anderson Schneider; Chuanyang Zheng; Jiankai Sun; Liangchen Tan; Mac Schwager; XiaoDong Liu; Yihang Gao; Yuehao Wang; Yuriy Nevmyvaka

arxiv: 2605.06501 · v2 · pith:4TBBZIWAnew · submitted 2026-05-07 · 💻 cs.LG · cs.CL

Cubit: Token Mixer with Kernel Ridge Regression

Chuanyang Zheng , Jiankai Sun , Yihang Gao , Yuehao Wang , Liangchen Tan , Mac Schwager , Anderson Schneider , Yuriy Nevmyvaka

show 1 more author

Xiaodong Liu

This is my paper

Pith reviewed 2026-05-20 22:30 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords Kernel Ridge RegressionToken mixingTransformer attentionNadaraya-Watson regressionLong-sequence modelingCubit architectureLimited-Range Rescale

0 comments

The pith

Cubit replaces the Transformer's attention with a Kernel Ridge Regression token mixer to strengthen long-sequence modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper interprets standard Transformer attention as performing Nadaraya-Watson regression on token similarities. It introduces Cubit, which substitutes the closed-form solution of Kernel Ridge Regression for value aggregation and kernel-matrix inversion for normalization, while adding Limited-Range Rescale to keep training stable. This substitution is presented as giving the architecture a firmer mathematical basis than Nadaraya-Watson regression. Experiments indicate that the resulting model improves performance on long sequences and that the advantage grows as the length of sequences seen during training increases.

Core claim

Cubit modifies classical attention by using the closed-form KRR solution that combines kernel-similarity value aggregation with normalization through the inverse kernel matrix, augmented by LRR rescaling for stability. The architecture thereby rests on Kernel Ridge Regression rather than Nadaraya-Watson regression and shows stronger long-sequence modeling whose gains increase with training sequence length.

What carries the argument

Kernel Ridge Regression token mixer that replaces attention by substituting its closed-form solution plus Limited-Range Rescale for the Nadaraya-Watson computation.

If this is right

Cubit rests on a closed-form regression solution rather than the similarity-weighted average used in attention.
Performance advantage over the vanilla Transformer grows as the length of training sequences increases.
The Limited-Range Rescale step is required to maintain training stability when the KRR formulation is adopted.
The architecture supplies a concrete alternative token-mixing primitive that can be swapped into existing Transformer pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other sequence models that currently rely on similarity-based aggregation might similarly benefit from substituting closed-form kernel methods.
The regression view of token mixing invites direct comparisons of different kernel choices or regularization strengths inside the same framework.
If the scaling trend continues, Cubit-style mixers could reduce the need for specialized long-context techniques such as sparse attention or memory banks.

Load-bearing premise

Replacing Nadaraya-Watson regression inside attention with the closed-form Kernel Ridge Regression solution plus Limited-Range Rescale will improve long-range modeling without creating new instabilities or demanding extensive retuning.

What would settle it

Training Cubit and a matched Transformer on the same long-sequence tasks while steadily increasing sequence length; if the performance gap fails to widen or training of Cubit becomes unstable without extra hyper-parameter search, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.06501 by Anderson Schneider, Chuanyang Zheng, Jiankai Sun, Liangchen Tan, Mac Schwager, XiaoDong Liu, Yihang Gao, Yuehao Wang, Yuriy Nevmyvaka.

**Figure 1.** Figure 1: The performance of different methods on the Arxiv and Books3 dataset, with model view at source ↗

**Figure 2.** Figure 2: The performance of different methods on the FineWeb dataset, with model parameter view at source ↗

**Figure 3.** Figure 3: The performance of long training length on the FineWeb dataset, with model parameter view at source ↗

**Figure 4.** Figure 4: The performance of larger model size on the FineWeb dataset. view at source ↗

**Figure 5.** Figure 5: The performance of the share key embedding and no Limited-Range Rescale, with model view at source ↗

**Figure 6.** Figure 6: The performance of long training length on the FineWeb dataset, with model parameter view at source ↗

read the original abstract

Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresponding values accordingly. Motivated by this perspective, we propose Cubit, a potential next-generation architecture that leverages Kernel Ridge Regression (KRR), while the vanilla Transformer relies on Nadaraya-Watson regression. Specifically, Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix. To improve the training stability, we further propose the Limited-Range Rescale (LRR), which rescales the value layer within a controlled range. We argue that Cubit, as a KRR-based architecture, provides a stronger mathematical foundation than the vanilla Transformer, whose attention mechanism corresponds to Nadaraya-Watson regression. We validate this claim through comprehensive experiments. The experimental results suggest that Cubit may exhibit stronger long-sequence modeling capability. In particular, its performance gain over the Transformer appears to increase as the training sequence length grows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cubit swaps Nadaraya-Watson attention for a closed-form KRR mixer plus LRR, but the n-by-n inverse still looks like a cubic bottleneck that clashes with the long-sequence gains they highlight.

read the letter

The main thing here is that the paper maps standard attention to Nadaraya-Watson regression and then substitutes the closed-form KRR solution for the mixing step, with an added Limited-Range Rescale to keep values stable during training. That specific replacement is the concrete new step beyond earlier kernel-attention links in the literature. They run experiments that reportedly show the advantage over Transformers growing as sequence length increases, which is the result they lean on to claim better long-range modeling. The regression framing itself is laid out clearly enough to make the motivation readable. The soft spot is the scaling one the stress-test note flags. Forming or inverting the kernel matrix per layer or head is cubic in sequence length for a dense kernel, and nothing in the abstract or the described construction mentions random features, low-rank updates, or iterative solvers that would keep the exact closed form while restoring linear or quadratic cost. LRR is only for numerical range, not complexity. If the reported gains come from lengths where the cubic term is still cheap, they do not yet test the regime the central claim emphasizes. Without equations or implementation details on how the inverse is computed or approximated at scale, it is hard to judge whether the math delivers in practice or just adds overhead. This paper is for readers already thinking about kernel methods or non-attention token mixers. Someone working on long-context alternatives might pick up the regression perspective as a starting point, though they would need to see the full implementation before trying it. It has a clear enough construction and some experimental signal to deserve a serious referee rather than a desk reject, even with the open questions on complexity.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Cubit, a token-mixing architecture that reinterprets standard Transformer attention as Nadaraya-Watson regression and replaces it with a Kernel Ridge Regression (KRR) formulation. Cubit incorporates the closed-form KRR solution for value aggregation via kernel similarities and normalization via the inverse kernel matrix, with Limited-Range Rescale (LRR) added for training stability. The central claims are that this yields a stronger mathematical foundation than vanilla attention and superior long-sequence modeling performance whose advantage grows with increasing training sequence length, validated through experiments.

Significance. The regression-based reinterpretation of attention and the explicit use of closed-form KRR provide a coherent theoretical lens that could inspire kernel-grounded alternatives to attention. The introduction of LRR for stability is a practical contribution. If the scalability concerns can be resolved without losing the exact closed-form property, the work could influence designs for long-context models; however, the current formulation's complexity limits its immediate significance for the regimes where gains are claimed to increase.

major comments (2)

[Abstract] Abstract: the description of Cubit as incorporating 'the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix' implies per-layer formation and inversion of an n×n Gram matrix (output = K(K + λI)^{-1}V or equivalent). This incurs O(n^3) cost that is not addressed by any low-rank, random-feature, or iterative-solver technique, directly undermining the claim that performance gains increase with training sequence length.
[§4 (Experiments)] §4 (Experiments): no sequence lengths, wall-clock timings, or memory profiles are reported for the long-sequence regime, nor is it stated whether the kernel inverse was computed exactly or approximated. Without these details the empirical support for the 'performance gain increases as training sequence length grows' claim cannot be evaluated against the cubic scaling inherent in the stated formulation.

minor comments (2)

[§3.2] The mathematical definition of LRR (rescaling range, interaction with the KRR closed form) is only described at a high level; an explicit equation would clarify its effect on the solution.
[§3] Notation for the kernel matrix K and regularization parameter λ should be introduced once and used consistently across the method and complexity discussion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. The points raised regarding computational complexity and the need for detailed experimental reporting are valid and will help improve the clarity of the manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the description of Cubit as incorporating 'the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix' implies per-layer formation and inversion of an n×n Gram matrix (output = K(K + λI)^{-1}V or equivalent). This incurs O(n^3) cost that is not addressed by any low-rank, random-feature, or iterative-solver technique, directly undermining the claim that performance gains increase with training sequence length.

Authors: We agree that the current Cubit formulation uses the exact closed-form KRR solution, which requires forming and inverting an n×n kernel matrix per layer and therefore has cubic complexity. This is a real limitation that prevents direct application to arbitrarily long sequences without further approximations. Our experiments show performance advantages that grow with sequence length within the tested range (up to a few thousand tokens), but we do not claim the method is already scalable to extreme lengths. In the revision we will update the abstract to explicitly note the O(n^3) cost and add a short discussion of possible future approximations that preserve the closed-form regression interpretation. revision: partial
Referee: [§4 (Experiments)] §4 (Experiments): no sequence lengths, wall-clock timings, or memory profiles are reported for the long-sequence regime, nor is it stated whether the kernel inverse was computed exactly or approximated. Without these details the empirical support for the 'performance gain increases as training sequence length grows' claim cannot be evaluated against the cubic scaling inherent in the stated formulation.

Authors: We thank the referee for highlighting this omission. The kernel inverse was computed exactly using standard dense linear-algebra routines on GPU for the sequence lengths employed in our experiments. We will revise Section 4 to report the exact sequence lengths tested, wall-clock training and inference times, and peak memory usage for both Cubit and the Transformer baseline. These additions will allow readers to directly assess the performance–compute trade-off. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is an explicit architectural substitution

full rationale

The paper's chain begins with an interpretive claim that standard attention equals Nadaraya-Watson kernel regression, then deliberately substitutes the closed-form KRR solution plus LRR rescaling to obtain Cubit. This substitution is presented as a motivated design choice rather than a tautology in which the output is defined to equal the input. The stronger-foundation argument follows directly from the chosen replacement, and the long-sequence performance claim is offered as an empirical observation to be validated by experiments, not as a quantity recovered by construction from fitted parameters or prior self-citations. No load-bearing self-citation, ansatz smuggling, or renaming of known results appears in the provided text; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the KRR formulation and LRR rescaling are presented as direct substitutions without listing regularization constants or kernel choices as fitted quantities.

pith-pipeline@v0.9.0 · 5795 in / 1111 out tokens · 19766 ms · 2026-05-20T22:30:30.029289+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We argue that Cubit, as a KRR-based architecture, provides a stronger mathematical foundation than the vanilla Transformer, whose attention mechanism corresponds to Nadaraya-Watson regression.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · 15 internal anchors

[1]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

work page 2023
[2]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[3]

Effect of dimensionality on convergence rates of kernel ridge regression estimator.Journal of Statistical Planning and Inference, 236:106228, 2025

Kwan-Young Bak and Woojoo Lee. Effect of dimensionality on convergence rates of kernel ridge regression estimator.Journal of Statistical Planning and Inference, 236:106228, 2025

work page 2025
[4]

Overfitting regimes of nadaraya-watson interpolators.arXiv e-prints, pages arXiv–2502, 2025

Daniel Barzilai, Guy Kornowski, and Ohad Shamir. Overfitting regimes of nadaraya-watson interpolators.arXiv e-prints, pages arXiv–2502, 2025

work page 2025
[5]

xlstm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547–107603, 2024

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547–107603, 2024

work page 2024
[6]

Sage, 2004

Richard A Berk.Regression analysis: A constructive critique, volume 11. Sage, 2004

work page 2004
[7]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020
[8]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 12

work page 1901
[9]

Local linear regression estimator on the boundary correction in nonparametric regression estimation.Journal of Statistical Theory and Applications, 19(3):460– 471, 2020

Langat Reuben Cheruiyot. Local linear regression estimator on the boundary correction in nonparametric regression estimation.Journal of Statistical Theory and Applications, 19(3):460– 471, 2020

work page 2020
[10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Lowess: A program for smoothing scatterplots by robust locally weighted regression.The American Statistician, 35(1):54, 1981

William S Cleveland. Lowess: A program for smoothing scatterplots by robust locally weighted regression.The American Statistician, 35(1):54, 1981

work page 1981
[12]

Locally weighted regression: an approach to regression analysis by local fitting.Journal of the American statistical association, 83(403):596–610, 1988

William S Cleveland and Susan J Devlin. Locally weighted regression: an approach to regression analysis by local fitting.Journal of the American statistical association, 83(403):596–610, 1988

work page 1988
[13]

Smoothing by local regression: Principles and methods

William S Cleveland and Catherine Loader. Smoothing by local regression: Principles and methods. InStatistical theory and computational aspects of smoothing: Proceedings of the COMPSTAT’94 Satellite Meeting held in Semmering, Austria, 27–28 August 1994, pages 10–49. Springer, 2013

work page 1994
[14]

Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y

Damai Dai, Chengqi Deng, Chenggang Zhao, R.x. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.k. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,P...

work page 2024
[15]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

work page 2021
[17]

Finding structure in time.Cognitive science, 14(2):179–211, 1990

Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990

work page 1990
[18]

Distributed representations, simple recurrent networks, and grammatical structure.Machine learning, 7(2):195–225, 1991

Jeffrey L Elman. Distributed representations, simple recurrent networks, and grammatical structure.Machine learning, 7(2):195–225, 1991

work page 1991
[19]

Semiparametric estimates of the relation between weather and electricity sales.Journal of the American statistical Association, 81(394):310–320, 1986

Robert F Engle, Clive WJ Granger, John Rice, and Andrew Weiss. Semiparametric estimates of the relation between weather and electricity sales.Journal of the American statistical Association, 81(394):310–320, 1986

work page 1986
[20]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[21]

USAF school of Aviation Medicine, 1985

Evelyn Fix.Discriminatory analysis: nonparametric discrimination, consistency properties, volume 1. USAF school of Aviation Medicine, 1985

work page 1985
[22]

cambridge university press, 2009

David A Freedman.Statistical models: theory and practice. cambridge university press, 2009. 13

work page 2009
[23]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024
[24]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

work page 2021
[25]

Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

Alex Graves. Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

work page 2012
[26]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling, 2024

work page 2024
[27]

On the parameterization and initialization of diagonal state space models.Advances in neural information processing systems, 35:35971–35983, 2022

Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models.Advances in neural information processing systems, 35:35971–35983, 2022

work page 2022
[28]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Local kernel ridge regression for scalable, interpolating, continuous regression

Mingxuan Han, Chenglong Ye, and Jeff Phillips. Local kernel ridge regression for scalable, interpolating, continuous regression. 2022

work page 2022
[30]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[31]

Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

John J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

work page 1982
[32]

Consistency of local linear regression estimator for mixtures with varying concentrations.Modern Stochastics: Theory and Applications, 11(3):359–372, 2024

Daniel Horbunov and Rostyslav Maiboroda. Consistency of local linear regression estimator for mixtures with varying concentrations.Modern Stochastics: Theory and Applications, 11(3):359–372, 2024

work page 2024
[33]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017
[34]

Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

work page 1991
[35]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Improvement of boundary bias in nonparametric regression via twicing technique

Jae-Keun Jo. Improvement of boundary bias in nonparametric regression via twicing technique. Communications for Statistical Applications and Methods, 4(2):445–452, 1997. 14

work page 1997
[37]

Serial order: A parallel distributed processing approach

Michael I Jordan. Serial order: A parallel distributed processing approach. 1986

work page 1986
[38]

Finetuning pretrained transformers into rnns

Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 10630–10643, 2021

work page 2021
[39]

Multiplicative LSTM for sequence modelling

Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative lstm for sequence modelling.arXiv preprint arXiv:1609.07959, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[40]

Race: Large-scale reading comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 785–794, 2017

work page 2017
[41]

Virtual width networks.arXiv preprint arXiv:2511.11238, 2025

Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chengyin Xu, Chi Zhang, Chong Hu, et al. Virtual width networks.arXiv preprint arXiv:2511.11238, 2025

work page arXiv 2025
[42]

Video-LLaVA: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-LLaVA: Learning united visual representation by alignment before projection. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5971–5984, Miami, Florida, USA, November

work page 2024
[43]

Association for Computational Linguistics

work page
[44]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Optimal rates and saturation for noiseless kernel ridge regression.arXiv preprint arXiv:2402.15718, 2024

Jihao Long, Xiaojun Peng, and Lei Wu. Optimal rates and saturation for noiseless kernel ridge regression.arXiv preprint arXiv:2402.15718, 2024

work page arXiv 2024
[46]

Fineweb-edu: the finest collection of educational content, 2024

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024

work page 2024
[47]

Megalodon: Efficient llm pretraining and inference with unlimited context length.Advances in Neural Information Processing Systems, 37:71831–71854, 2024

Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, and Chunting Zhou. Megalodon: Efficient llm pretraining and inference with unlimited context length.Advances in Neural Information Processing Systems, 37:71831–71854, 2024

work page 2024
[48]

the smoothing of time series

Frederick R Macaulay. Introduction to" the smoothing of time series". InThe Smoothing of Time Series, pages 17–30. NBER, 1931

work page 1931
[49]

Parallelizing spectrally regularized kernel algorithms

Nicole MÃžcke and Gilles Blanchard. Parallelizing spectrally regularized kernel algorithms. Journal of Machine Learning Research, 19(30):1–29, 2018

work page 2018
[50]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation

AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Accessed: 4-7-2025

work page 2025
[51]

Regularized least squares learning with heavy-tailed noise is minimax optimal.arXiv preprint arXiv:2505.14214, 2025

Mattes Mollenhauer, Nicole MÃžcke, Dimitri Meunier, and Arthur Gretton. Regularized least squares learning with heavy-tailed noise is minimax optimal.arXiv preprint arXiv:2505.14214, 2025. 15

work page arXiv 2025
[52]

MIT press, 2012

Kevin P Murphy.Machine learning: a probabilistic perspective. MIT press, 2012

work page 2012
[53]

Wf sheppard’s smoothing method: A precursor to local polynomial regression.International Statistical Review, 87(3):604–612, 2019

Lori Murray and David Bellhouse. Wf sheppard’s smoothing method: A precursor to local polynomial regression.International Statistical Review, 87(3):604–612, 2019

work page 2019
[54]

On estimating regression.Theory of Probability & Its Applications, 9(1):141–142, 1964

Elizbar A Nadaraya. On estimating regression.Theory of Probability & Its Applications, 9(1):141–142, 1964

work page 1964
[55]

Generalized linear models.Journal of the Royal Statistical Society Series A: Statistics in Society, 135(3):370–384, 1972

John Ashworth Nelder and Robert WM Wedderburn. Generalized linear models.Journal of the Royal Statistical Society Series A: Statistics in Society, 135(3):370–384, 1972

work page 1972
[56]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019
[57]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[58]

Rwkv-7" goose" with expressive dynamic state evolution

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456, 2025

work page arXiv 2025
[59]

Train short, test long: Attention with linear biases enables input length extrapolation

Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InInternational Conference on Learning Representations, 2022

work page 2022
[60]

From sparse to soft mixtures of experts

Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[61]

Hgrn2: Gated linear rnns with state expansion.ArXiv preprint, abs/2404.07904, 2024

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024

work page arXiv 2024
[62]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

arXiv:2409.04431 , year =

Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, et al. Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431, 2024

work page arXiv 2024
[64]

Decoder-hybrid-decoder architecture for efficient reasoning with long generation.arXiv preprint arXiv:2507.06607, 2025

Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, et al. Decoder-hybrid-decoder architecture for efficient reasoning with long generation.arXiv preprint arXiv:2507.06607, 2025

work page arXiv 2025
[65]

Scaling vision with sparse mixture of experts

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021. 16

work page 2021
[66]

Hash layers for large sparse models

Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models. advances in neural information processing systems, 34:17555–17566, 2021

work page 2021
[67]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[68]

Social iqa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019

work page 2019
[69]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

work page 1992
[70]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[71]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

work page 2017
[72]

Jetmoe: Reaching llama2 performance with 0.1 m dollars.arXiv preprint arXiv:2404.07413, 2024

Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. Jetmoe: Reaching llama2 performance with 0.1 m dollars.arXiv preprint arXiv:2404.07413, 2024

work page arXiv 2024
[73]

Simplified State Space Layers for Sequence Modeling

Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[74]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014

work page 2014
[76]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

work page internal anchor Pith review arXiv 2026
[78]

A rank-invariant method of linear and polynomial regression analysis.Indagationes mathematicae, 12(85):173, 1950

Henri Theil. A rank-invariant method of linear and polynomial regression analysis.Indagationes mathematicae, 12(85):173, 1950

work page 1950
[79]

Logistic regression: relating patient characteristics to outcomes.Jama, 316(5):533–534, 2016

Juliana Tolles and William J Meurer. Logistic regression: relating patient characteristics to outcomes.Jama, 316(5):533–534, 2016. 17

work page 2016
[80]

Mlp- mixer: An all-mlp architecture for vision.Advances in neural information processing systems, 34:24261–24272, 2021

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp- mixer: An all-mlp architecture for vision.Advances in neural information processing systems, 34:24261–24272, 2021

work page 2021

Showing first 80 references.

[1] [1]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

work page 2023

[2] [2]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[3] [3]

Effect of dimensionality on convergence rates of kernel ridge regression estimator.Journal of Statistical Planning and Inference, 236:106228, 2025

Kwan-Young Bak and Woojoo Lee. Effect of dimensionality on convergence rates of kernel ridge regression estimator.Journal of Statistical Planning and Inference, 236:106228, 2025

work page 2025

[4] [4]

Overfitting regimes of nadaraya-watson interpolators.arXiv e-prints, pages arXiv–2502, 2025

Daniel Barzilai, Guy Kornowski, and Ohad Shamir. Overfitting regimes of nadaraya-watson interpolators.arXiv e-prints, pages arXiv–2502, 2025

work page 2025

[5] [5]

xlstm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547–107603, 2024

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547–107603, 2024

work page 2024

[6] [6]

Sage, 2004

Richard A Berk.Regression analysis: A constructive critique, volume 11. Sage, 2004

work page 2004

[7] [7]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020

[8] [8]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 12

work page 1901

[9] [9]

Local linear regression estimator on the boundary correction in nonparametric regression estimation.Journal of Statistical Theory and Applications, 19(3):460– 471, 2020

Langat Reuben Cheruiyot. Local linear regression estimator on the boundary correction in nonparametric regression estimation.Journal of Statistical Theory and Applications, 19(3):460– 471, 2020

work page 2020

[10] [10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Lowess: A program for smoothing scatterplots by robust locally weighted regression.The American Statistician, 35(1):54, 1981

William S Cleveland. Lowess: A program for smoothing scatterplots by robust locally weighted regression.The American Statistician, 35(1):54, 1981

work page 1981

[12] [12]

Locally weighted regression: an approach to regression analysis by local fitting.Journal of the American statistical association, 83(403):596–610, 1988

William S Cleveland and Susan J Devlin. Locally weighted regression: an approach to regression analysis by local fitting.Journal of the American statistical association, 83(403):596–610, 1988

work page 1988

[13] [13]

Smoothing by local regression: Principles and methods

William S Cleveland and Catherine Loader. Smoothing by local regression: Principles and methods. InStatistical theory and computational aspects of smoothing: Proceedings of the COMPSTAT’94 Satellite Meeting held in Semmering, Austria, 27–28 August 1994, pages 10–49. Springer, 2013

work page 1994

[14] [14]

Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y

Damai Dai, Chengqi Deng, Chenggang Zhao, R.x. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.k. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,P...

work page 2024

[15] [15]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

work page 2021

[17] [17]

Finding structure in time.Cognitive science, 14(2):179–211, 1990

Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990

work page 1990

[18] [18]

Distributed representations, simple recurrent networks, and grammatical structure.Machine learning, 7(2):195–225, 1991

Jeffrey L Elman. Distributed representations, simple recurrent networks, and grammatical structure.Machine learning, 7(2):195–225, 1991

work page 1991

[19] [19]

Semiparametric estimates of the relation between weather and electricity sales.Journal of the American statistical Association, 81(394):310–320, 1986

Robert F Engle, Clive WJ Granger, John Rice, and Andrew Weiss. Semiparametric estimates of the relation between weather and electricity sales.Journal of the American statistical Association, 81(394):310–320, 1986

work page 1986

[20] [20]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022

[21] [21]

USAF school of Aviation Medicine, 1985

Evelyn Fix.Discriminatory analysis: nonparametric discrimination, consistency properties, volume 1. USAF school of Aviation Medicine, 1985

work page 1985

[22] [22]

cambridge university press, 2009

David A Freedman.Statistical models: theory and practice. cambridge university press, 2009. 13

work page 2009

[23] [23]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024

[24] [24]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

work page 2021

[25] [25]

Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

Alex Graves. Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

work page 2012

[26] [26]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling, 2024

work page 2024

[27] [27]

On the parameterization and initialization of diagonal state space models.Advances in neural information processing systems, 35:35971–35983, 2022

Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models.Advances in neural information processing systems, 35:35971–35983, 2022

work page 2022

[28] [28]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[29] [29]

Local kernel ridge regression for scalable, interpolating, continuous regression

Mingxuan Han, Chenglong Ye, and Jeff Phillips. Local kernel ridge regression for scalable, interpolating, continuous regression. 2022

work page 2022

[30] [30]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[31] [31]

Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

John J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

work page 1982

[32] [32]

Consistency of local linear regression estimator for mixtures with varying concentrations.Modern Stochastics: Theory and Applications, 11(3):359–372, 2024

Daniel Horbunov and Rostyslav Maiboroda. Consistency of local linear regression estimator for mixtures with varying concentrations.Modern Stochastics: Theory and Applications, 11(3):359–372, 2024

work page 2024

[33] [33]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017

[34] [34]

Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

work page 1991

[35] [35]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Improvement of boundary bias in nonparametric regression via twicing technique

Jae-Keun Jo. Improvement of boundary bias in nonparametric regression via twicing technique. Communications for Statistical Applications and Methods, 4(2):445–452, 1997. 14

work page 1997

[37] [37]

Serial order: A parallel distributed processing approach

Michael I Jordan. Serial order: A parallel distributed processing approach. 1986

work page 1986

[38] [38]

Finetuning pretrained transformers into rnns

Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 10630–10643, 2021

work page 2021

[39] [39]

Multiplicative LSTM for sequence modelling

Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative lstm for sequence modelling.arXiv preprint arXiv:1609.07959, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[40] [40]

Race: Large-scale reading comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 785–794, 2017

work page 2017

[41] [41]

Virtual width networks.arXiv preprint arXiv:2511.11238, 2025

Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chengyin Xu, Chi Zhang, Chong Hu, et al. Virtual width networks.arXiv preprint arXiv:2511.11238, 2025

work page arXiv 2025

[42] [42]

Video-LLaVA: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-LLaVA: Learning united visual representation by alignment before projection. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5971–5984, Miami, Florida, USA, November

work page 2024

[43] [43]

Association for Computational Linguistics

work page

[44] [44]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Optimal rates and saturation for noiseless kernel ridge regression.arXiv preprint arXiv:2402.15718, 2024

Jihao Long, Xiaojun Peng, and Lei Wu. Optimal rates and saturation for noiseless kernel ridge regression.arXiv preprint arXiv:2402.15718, 2024

work page arXiv 2024

[46] [46]

Fineweb-edu: the finest collection of educational content, 2024

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024

work page 2024

[47] [47]

Megalodon: Efficient llm pretraining and inference with unlimited context length.Advances in Neural Information Processing Systems, 37:71831–71854, 2024

Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, and Chunting Zhou. Megalodon: Efficient llm pretraining and inference with unlimited context length.Advances in Neural Information Processing Systems, 37:71831–71854, 2024

work page 2024

[48] [48]

the smoothing of time series

Frederick R Macaulay. Introduction to" the smoothing of time series". InThe Smoothing of Time Series, pages 17–30. NBER, 1931

work page 1931

[49] [49]

Parallelizing spectrally regularized kernel algorithms

Nicole MÃžcke and Gilles Blanchard. Parallelizing spectrally regularized kernel algorithms. Journal of Machine Learning Research, 19(30):1–29, 2018

work page 2018

[50] [50]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation

AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Accessed: 4-7-2025

work page 2025

[51] [51]

Regularized least squares learning with heavy-tailed noise is minimax optimal.arXiv preprint arXiv:2505.14214, 2025

Mattes Mollenhauer, Nicole MÃžcke, Dimitri Meunier, and Arthur Gretton. Regularized least squares learning with heavy-tailed noise is minimax optimal.arXiv preprint arXiv:2505.14214, 2025. 15

work page arXiv 2025

[52] [52]

MIT press, 2012

Kevin P Murphy.Machine learning: a probabilistic perspective. MIT press, 2012

work page 2012

[53] [53]

Wf sheppard’s smoothing method: A precursor to local polynomial regression.International Statistical Review, 87(3):604–612, 2019

Lori Murray and David Bellhouse. Wf sheppard’s smoothing method: A precursor to local polynomial regression.International Statistical Review, 87(3):604–612, 2019

work page 2019

[54] [54]

On estimating regression.Theory of Probability & Its Applications, 9(1):141–142, 1964

Elizbar A Nadaraya. On estimating regression.Theory of Probability & Its Applications, 9(1):141–142, 1964

work page 1964

[55] [55]

Generalized linear models.Journal of the Royal Statistical Society Series A: Statistics in Society, 135(3):370–384, 1972

John Ashworth Nelder and Robert WM Wedderburn. Generalized linear models.Journal of the Royal Statistical Society Series A: Statistics in Society, 135(3):370–384, 1972

work page 1972

[56] [56]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019

[57] [57]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024

[58] [58]

Rwkv-7" goose" with expressive dynamic state evolution

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456, 2025

work page arXiv 2025

[59] [59]

Train short, test long: Attention with linear biases enables input length extrapolation

Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InInternational Conference on Learning Representations, 2022

work page 2022

[60] [60]

From sparse to soft mixtures of experts

Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[61] [61]

Hgrn2: Gated linear rnns with state expansion.ArXiv preprint, abs/2404.07904, 2024

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024

work page arXiv 2024

[62] [62]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

arXiv:2409.04431 , year =

Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, et al. Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431, 2024

work page arXiv 2024

[64] [64]

Decoder-hybrid-decoder architecture for efficient reasoning with long generation.arXiv preprint arXiv:2507.06607, 2025

Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, et al. Decoder-hybrid-decoder architecture for efficient reasoning with long generation.arXiv preprint arXiv:2507.06607, 2025

work page arXiv 2025

[65] [65]

Scaling vision with sparse mixture of experts

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021. 16

work page 2021

[66] [66]

Hash layers for large sparse models

Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models. advances in neural information processing systems, 34:17555–17566, 2021

work page 2021

[67] [67]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021

[68] [68]

Social iqa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019

work page 2019

[69] [69]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

work page 1992

[70] [70]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[71] [71]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

work page 2017

[72] [72]

Jetmoe: Reaching llama2 performance with 0.1 m dollars.arXiv preprint arXiv:2404.07413, 2024

Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. Jetmoe: Reaching llama2 performance with 0.1 m dollars.arXiv preprint arXiv:2404.07413, 2024

work page arXiv 2024

[73] [73]

Simplified State Space Layers for Sequence Modeling

Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[74] [74]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[75] [75]

Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014

work page 2014

[76] [76]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [77]

Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

work page internal anchor Pith review arXiv 2026

[78] [78]

A rank-invariant method of linear and polynomial regression analysis.Indagationes mathematicae, 12(85):173, 1950

Henri Theil. A rank-invariant method of linear and polynomial regression analysis.Indagationes mathematicae, 12(85):173, 1950

work page 1950

[79] [79]

Logistic regression: relating patient characteristics to outcomes.Jama, 316(5):533–534, 2016

Juliana Tolles and William J Meurer. Logistic regression: relating patient characteristics to outcomes.Jama, 316(5):533–534, 2016. 17

work page 2016

[80] [80]

Mlp- mixer: An all-mlp architecture for vision.Advances in neural information processing systems, 34:24261–24272, 2021

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp- mixer: An all-mlp architecture for vision.Advances in neural information processing systems, 34:24261–24272, 2021

work page 2021