Scaling Laws for Behavioral Foundation Models over User Event Sequences

Rickard Br\"uel Gabrielsson

arxiv: 2606.05257 · v1 · pith:I7JBP5CGnew · submitted 2026-06-03 · 💻 cs.LG · cs.IR

Scaling Laws for Behavioral Foundation Models over User Event Sequences

Rickard Br\"uel Gabrielsson This is my paper

Pith reviewed 2026-06-28 06:46 UTC · model grok-4.3

classification 💻 cs.LG cs.IR

keywords scaling lawsbehavioral foundation modelsuser event sequencesembedder parameterscompute optimal allocationnegative samplingranking metricsrecommendation models

0 comments

The pith

A small embedder of about 2% of parameters is compute-optimal at every budget for behavioral foundation models on user event sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates scaling laws for training foundation models on sequences of user actions in areas like recommendation and commerce. It uses a two-part model with an event embedder and a transformer contextualizer, varying the parameter allocation, batch size, data allocation, and negative samples across 600 runs spanning many orders of compute. The key finding is that a small embedder fraction is always optimal because those parameters are costlier per step and see more repeats. It also reveals that the best data-to-parameter ratio shifts with scale and that loss and ranking metrics diverge in a scale-dependent way, making the metric choice part of the scaling law.

Core claim

Across 600 runs on real interaction data from 10^15 to 10^19 FLOPs, the optimal embedder size s* is approximately 2% of total parameters at every compute budget. Embedder parameters are more expensive per training step and are exposed to far more repeated items than the contextualizer parameters. As a result, compute-optimal training starts data-heavy relative to language models but the D/N ratio approaches the Chinchilla heuristic at higher compute. The sampled training objective and ranking metrics disagree in ways that scale with compute and metric choice, with larger budgets preferring more negatives until memory limits are hit.

What carries the argument

The parameter split between the feature-based event embedder and the decoder-only transformer, which determines the optimal allocation because of differing per-step costs and repetition rates.

If this is right

Embedder size should stay small even as total model size grows.
Data allocation should be larger relative to parameters at smaller compute budgets.
Negative sample count should increase with scale until candidate memory becomes the bottleneck.
The choice of ranking metric affects the optimal training configuration.
Scaling laws for these models must incorporate the evaluation metric as a variable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production systems could use this to allocate fewer parameters to embeddings and reduce overall training cost.
Similar scaling might hold for other sequence models with high item repetition.
Future work could test if unifying the embedder and contextualizer changes the optimal split.
Disagreement scaling suggests loss functions may need to be adjusted or combined with ranking objectives at large scales.

Load-bearing premise

The trends in optimal embedder fraction and scaling behaviors will continue to hold outside the specific real interaction datasets and two-part architecture used in the experiments.

What would settle it

A new set of scaling runs on different user event data or with a different architecture showing that the optimal embedder fraction changes substantially with compute budget or is not around 2%.

Figures

Figures reproduced from arXiv: 2606.05257 by Rickard Br\"uel Gabrielsson.

**Figure 1.** Figure 1: Width sweep: every headline eval metric vs. target embedder parameter share s. One curve per compute budget; solid lines are the per-(metric, budget) two-term starvation fits of (2). ⋆ = analytic s ⋆ inside the swept range s∈[1%, 50%]; + = boundary case (s ⋆ extrapolated outside the sweep, almost always coverage@10 at large C). The Kaplan-FLOP-share view of the same cells is Appendix [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 2.** Figure 2: s ⋆ (C) summary, per metric. Visual companion to [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Per-metric Kaplan fits of (3); dotted lines mark B (m) crit and S min m per panel. val_loss and recall@10 coincide (Bcrit ≈574 and 544); NDCG@10 and MRR@10 sit notably lower; val_entropy has no plateau within our B range (Tab. 3 footnote). EWMA-smoothed trajectories and iso-target crossings are in Appendix 10 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Iso-FLOP curves per metric, with parabolic fits. Each panel scatters one headline metric vs. N along each compute budget and overlays the quadratic-in-log N fit per budget; ⋆ sits at the parabola’s analytic optimum (minimum for losses/entropy, maximum for ranking metrics). The plus-marker variant is used at (COVERAGE@10, C = 1018) where the coverage curve is essentially flat (a≈0) and the parabola maximum … view at source ↗

**Figure 5.** Figure 5: Chinchilla frontier per metric. Left: N⋆ (C); middle: D⋆ (C); right: D⋆/N⋆ . Switching the target from val_loss/recall@10 to NDCG@10 pushes the optimum toward smaller models trained on more tokens at every budget [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Negative-sampling K-sweep, per headline metric, with the per-(metric,C) starvation/bias fit (6) overlaid. ⋆: analytic K⋆ inside the swept range K ∈[16k, 2M]; +: K⋆ extrapolated outside the sweep (boundary cases in [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Spearman rank correlations between headline metrics. Within either stage the loss/perplexity/entropy/ranking metrics are essentially one quantity; coverage is the only metric that decouples meaningfully [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Full-catalogue val_loss vs. recall@10. At fixed C, each column is one iso-FLOP budget; curves use the same deployed-catalogue eval pool (only training K differs). Red: val_loss (↓); green: recall@10 (↑); stars mark the swept-grid argmax per metric. saturates, which cells happen to spread the head distribution furthest is essentially independent of which cells minimize loss. High correlations are not the wh… view at source ↗

**Figure 9.** Figure 9: Batch-local vs. full-catalogue evaluation. Each point is one checkpoint scored under both regimes. Red squares mark Phase 4 K = 0 only. The first two rows are Stage 2 K-sweep strata (B = 64, C = 1015 and B = 512, C = 1019). The third row is the Phase 3 iso-FLOP architecture grid at C = 1019 , B = 512 (n= 6 cells re-evaluated under full-catalogue). Column 2 is the cross-metric panel: batch-local val_loss vs… view at source ↗

**Figure 10.** Figure 10: Embedder-share sweep (val_loss only). Left: validation loss vs. embedder share over s ∈ [0, 50]% at four compute budgets. Curves are monotone increasing in s at every budget over s∈[6, 50]%. Right: zoom on s∈[0, 6]% around the per-budget optimum. Solid lines are per-budget two-term starvation fits L(s) = E +a sα +b s−β ; stars mark the closed-form analytic optimum s ⋆ = (bβ/(aα))1/(α+β) ( [PITH_FULL_IMAG… view at source ↗

**Figure 11.** Figure 11: Width sweep vs. Kaplan embedder-side FLOP share f. Same cells as [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: N/D proxy diagnostics on the width sweep. Varying s moves compute between embedder and contextualizer at nearly fixed training-to-parameter ratio: Nemb+ctx/Dproxy is flat in s (a), while validation loss tracks s much more strongly than this residual ratio (b). Chinchilla diagnostic. Train and val parabolic minima need not coincide: they diverge notably at C = 1018 (39.5 M train vs. 19.4 M val) but land ne… view at source ↗

**Figure 13.** Figure 13: Depth-sweep parabolic fits behind [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Depth sweep: per-metric eval vs. s and Kaplan FLOP share f. Stars mark per-budget optima. Gray band: width-sweep s ⋆ ≈2% [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Per-metric trajectories vs. batch size. One curve per B; dashed line: iso-target Tm [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Training-loss channels at C = 1017 (not full-catalogue evaluation). Blue dashed: in-batch CE; red dashed: extra-negative CE (only for K > 0); black: their average, which SGD minimizes. The extra channel rises with K by construction; the in-batch channel moves with the checkpoint trained at each K. extra CE over the K sampled catalog negatives, averaged for optimization. These curves are not full-catalogue… view at source ↗

**Figure 17.** Figure 17: Analytic K⋆ (C) from (6), per metric (interior-cell summary of [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Negative-sampling efficiency surrogacy. Cell color is Spearman ρS between the Kcomparable in-batch training loss and the eval metric across the K axis (the optimized combined objective folds in a K-dependent extra-negative channel and is not comparable across K; cf [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: MuP halves the LR drift but does not improve loss. Solid: MuP, dashed: Default. MuP optima land at 10−2 at TINY, SMALL and LARGE, and at 5·10−3 at MEDIUM (a 0.30-decade band) versus 0.70 decades for Default. Default initialization sits below MuP at every model size [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Correlation matrices stratified by budget. The loss/perplexity/entropy/ranking block stays saturated at every C, but the coverage rows and columns visibly fade with scale. direction across K, so pushing K up reduces both) through near-zero at C ∈ {1016 , 1017} to −1.00 at C = 1018 (perfect alignment in the expected direction). (ii) The loss–coverage sign flips: −0.97– −0.22 in Stage 1 (bigger architecture… view at source ↗

**Figure 18.** Figure 18: Compare with the Stage 1 table (16): loss–ranking is no longer locked at [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 21.** Figure 21: Same recall@10, very different absolute val_losses depending on stage. Blue: Stage 1 (batch-local pool); red: Stage 2 (full catalogue). Within either cloud |ρS|≥0.97; across clouds the link breaks because the partition function is computed over a ∼1500× larger set. H Context-Length Scoring Robustness (full slice matrices) This appendix backs axis (e) of §7. The eval pipeline stratifies every batch by scor… view at source ↗

**Figure 22.** Figure 22: Context-length scoring robustness: worst-case cell-ranking agreement across historyposition slices, per regime. Per (regime, metric, budget) we compute the Spearman ρ between every pair of scoring slices {ctx_3, ctx_5, ctx_10, ctx_20, ctx_50, ctx_100, val/all} and plot the minimum off-diagonal pair. Left: batch-local Stage 1 evals (architecture sweep): every headline ranking metric stays ρmin ≥0.93 at C … view at source ↗

**Figure 23.** Figure 23: Per-budget Spearman ρ between context-length scoring slices, batch-local evals (Phase 1W+1D+3 architecture sweep). Each row is a metric, each column a budget; the per-budget cell-count n is in the column header. Every cell is the Spearman rank correlation between two scoring-slice columns across the architectural cells at that budget. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗

**Figure 24.** Figure 24: Per-budget Spearman ρ between context-length scoring slices, full-catalogue evals (Phase 4 K-sweep). Same axes as [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗

read the original abstract

Foundation models are increasingly trained on sequences of user actions in recommendation, payments, fraud, and commerce, but these models still lack the kind of compute calibration that scaling laws provide for language models. We study a common two-part behavioral-model architecture: a feature-based event embedder maps each multi-modal item to a vector, and a decoder-only transformer predicts the next event from the resulting sequence. Across roughly 600 runs on real interaction data, spanning $10^{15}$-$10^{19}$ training FLOPs, we jointly vary four deployment-relevant axes: the two-part parameter split, critical batch size, model/data allocation, and the number of sampled negatives used after freezing the embedder. A small embedder ($s^{\star}\!\approx\!2\%$ of parameters) is compute-optimal at every budget we test because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters. Compute-optimal training is data-heavy relative to text at low compute, but its $D/N$ ratio moves toward the Chinchilla heuristic as compute increases. The sampled training objective and deployed ranking metrics disagree in ways that themselves scale: critical batch size, optimal negative count after freezing, and the agreement between loss and ranking quality all shift with compute and with the chosen evaluation metric. For negative sampling, larger budgets increasingly prefer more negatives; by $10^{19}$ FLOPs the active constraint is candidate-axis memory rather than FLOPs. In behavioral foundation models, the evaluation metric is therefore part of the scaling law: changing it can change the compute-optimal recipe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps practical compute splits for embedder-transformer behavioral models on real event data, with a consistent 2% embedder optimum and metric-dependent trends, but the repetition explanation for that optimum is not isolated from the data statistics.

read the letter

The core result is that across 600 runs on interaction sequences, a two-part model with roughly 2% of parameters in the embedder is compute-optimal at every scale tested, training is data-heavy at low compute before approaching Chinchilla ratios, and loss versus ranking metrics diverge more as compute grows. Negative count preferences also shift with budget until memory bounds them.

The experiment jointly sweeps the parameter split, batch size, data allocation, and post-freeze negatives over five orders of magnitude in FLOPs. That joint variation on real recommendation-style data is the main addition relative to language-model scaling work. The observation that the choice of evaluation metric changes the optimal recipe is a useful practical note.

The repetition-based account for why the embedder should be small is reasonable given how item frequencies work in these datasets, but the runs do not include controlled variation of repetition rate, so the causal link stays observational rather than isolated. Details on error bars, data exclusion, and whether metric choices were pre-specified would help judge how stable the reported optima are.

This is aimed at teams training behavioral models for commerce or recommendations who need allocation heuristics beyond text scaling laws. The empirical volume is large enough that a serious referee should see it, even if the explanatory part would benefit from tighter controls.

Referee Report

2 major / 1 minor

Summary. The paper claims that for two-part behavioral foundation models (feature-based event embedder + decoder-only transformer) trained on user event sequences, a small embedder fraction s* ≈ 2% of total parameters is compute-optimal across all tested budgets from 10^15 to 10^19 FLOPs. This is attributed to embedder parameters being more expensive per step and exposed to more repeated items. The work also reports that compute-optimal D/N ratios start data-heavy relative to Chinchilla but approach it at higher compute, that critical batch size and optimal negative count after freezing shift with scale, and that sampled loss and ranking metrics disagree in ways that themselves scale with compute and metric choice. These conclusions rest on ~600 runs on real interaction datasets jointly varying parameter split, batch size, model/data allocation, and negative count.

Significance. If the central empirical trends hold, the paper supplies the first large-scale compute-optimal calibration for behavioral models on event sequences, directly relevant to recommendation, payments, and fraud domains. The scale of the experimental campaign (600 runs spanning five orders of magnitude in FLOPs) is a clear strength and provides substantial empirical support for the observed trends in embedder fraction and negative-sampling preferences.

major comments (2)

[Abstract] Abstract: the explanatory claim that s* ≈ 2% optimality arises 'because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters' is load-bearing for the interpretation. All 600 runs use real interaction datasets that exhibit high item repetition; the manuscript reports no controlled experiments that vary repetition rate (e.g., synthetic data with adjustable Zipf exponents) while holding other factors fixed, leaving open the possibility that the observed optimum is an artifact of the repetition statistics rather than a general architectural property.
[Abstract] Abstract and experimental description: the 2% optimality claim and the scaling trends for critical batch size and negative count lack reported error bars, statistical significance tests for the 2% figure, explicit data-exclusion rules, or analysis of whether post-hoc metric choices affect the central trends. These omissions make it difficult to assess robustness of the reported optima.

minor comments (1)

[Abstract] Abstract: the symbol s* is used before any definition or parenthetical explanation; a brief inline clarification would improve readability for readers encountering the abstract first.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and detailed comments. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the explanatory claim that s* ≈ 2% optimality arises 'because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters' is load-bearing for the interpretation. All 600 runs use real interaction datasets that exhibit high item repetition; the manuscript reports no controlled experiments that vary repetition rate (e.g., synthetic data with adjustable Zipf exponents) while holding other factors fixed, leaving open the possibility that the observed optimum is an artifact of the repetition statistics rather than a general architectural property.

Authors: We agree that our experiments are confined to real interaction datasets exhibiting high item repetition and that we did not perform controlled synthetic experiments varying repetition rates (e.g., via adjustable Zipf exponents). The 2% optimum and its proposed mechanism are therefore tied to the statistical properties of the real data used. While the trend is consistent across multiple distinct real-world datasets, this does not fully rule out dataset-specific artifacts. We will revise the abstract and discussion sections to present the explanation as a hypothesis grounded in architectural differences and observed repetition patterns in behavioral data, rather than a proven general causal factor, and will explicitly note the lack of synthetic controls as a limitation. revision: partial
Referee: [Abstract] Abstract and experimental description: the 2% optimality claim and the scaling trends for critical batch size and negative count lack reported error bars, statistical significance tests for the 2% figure, explicit data-exclusion rules, or analysis of whether post-hoc metric choices affect the central trends. These omissions make it difficult to assess robustness of the reported optima.

Authors: We acknowledge these gaps in the current version. In revision we will add error bars (computed from replicate runs where available) to all figures reporting the 2% optimum and scaling trends for batch size and negative count. We will include statistical significance tests for the identified optima. Data-exclusion criteria (based on convergence thresholds and outlier detection) will be stated explicitly in the experimental section. We will also add an analysis of how the central trends vary with different post-hoc metric choices and report the sensitivity of the scaling conclusions to metric selection. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical observations from 600 runs

full rationale

The paper presents scaling trends as outcomes of joint variation across four axes in ~600 real-data training runs spanning 10^15 to 10^19 FLOPs. Optimal embedder fraction s*≈2%, D/N ratios, negative counts, and metric disagreements are reported as measured quantities rather than quantities derived from equations that reduce to the inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text; the 'because' clause in the abstract is an interpretive summary of the observed trends, not a mathematical reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is empirical; the main ledger items are the observed optimal values rather than new theoretical constructs. The 2% embedder fraction is measured, not postulated a priori.

free parameters (1)

optimal embedder fraction s* = 0.02
Observed as approximately 2% from the joint variation of parameter split across the 600 runs; used as the central reported optimum.

axioms (1)

domain assumption The two-part feature-based embedder plus decoder-only transformer is a suitable architecture for modeling user event sequences.
Invoked by the choice of model family studied throughout the experiments.

pith-pipeline@v0.9.1-grok · 5813 in / 1419 out tokens · 39679 ms · 2026-06-28T06:46:47.207338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 10 canonical work pages · 5 internal anchors

[1]

Scaling Laws for Neural Language Models

J. Kaplan et al. Scaling laws for neural language models.arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

Training Compute-Optimal Large Language Models

J. Hoffmann et al. Training compute-optimal large language models (Chinchilla). arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Yang and E

G. Yang and E. Hu. Tensor Programs IV: Feature learning in infinite-width neural networks (MuP). InICML, 2021

2021
[4]

An Empirical Model of Large-Batch Training

S. McCandlish, J. Kaplan, D. Amodei, and the OpenAI Dota Team. An empirical model of large-batch training.arXiv:1812.06162, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Zhai et al

J. Zhai et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations (HSTU). InICML, 2024. 16

2024
[6]

Zhang et al

B. Zhang et al. Wukong: Towards a scaling law for large-scale recommendation. arXiv:2403.02545, 2024

work page arXiv 2024
[7]

Ardalani et al

N. Ardalani et al. Understanding scaling laws for recommendation models.arXiv:2208.08489, 2022

work page arXiv 2022
[8]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InNeurIPS, 2023

2023
[9]

Alayrac et al

J.-B. Alayrac et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022

2022
[10]

J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023

2023
[11]

Covington, J

P. Covington, J. Adams, and E. Sargin. Deep neural networks for YouTube recommendations. InACM RecSys, 2016

2016
[12]

Bengio and J.-S

Y . Bengio and J.-S. Senécal. Adaptive importance sampling to accelerate training of a neural probabilistic language model.IEEE Transactions on Neural Networks, 19(4):713–722, 2008

2008
[13]

Foundation Model for Personalized Recommendation

Netflix Technology Blog. Foundation Model for Personalized Recommendation. Mar. 2025. https://netflixtechblog.com/ foundation-model-for-personalized-recommendation-1a0bd8e02d39

2025
[14]

C.-C. M. Yeh, U. S. Saini, X. Dai, X. Fan, S. Jain et al. TREASURE: A transformer-based foundation model for high-volume transaction understanding (Visa Payment Foundation Model). arXiv:2511.19693, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Y . Dou, Z. Jiang, T. Zhang, M. Hu, Z. Xu, Y . Chen et al. TransactionGPT.arXiv:2511.08939, 2025

work page arXiv 2025
[16]

Kedia and the Stripe Machine Learning Team

G. Kedia and the Stripe Machine Learning Team. Stripe’s Payments Foundation Model. Stripe Sessions / Stripe Engineering, May 2025

2025
[17]

PRAGMA: Revolut Foundation Model

V . Iashin et al. PRAGMA: Revolut foundation model.arXiv:2604.08649, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Kawawa-Beaudan, D

M. Kawawa-Beaudan, D. Borrajo, M. Veloso et al. TradeFM: A generative foundation model for trade-flow and market microstructure.arXiv:2602.23784, 2026

work page arXiv 2026
[19]

Brüel Gabrielsson et al

R. Brüel Gabrielsson et al. A foundation model for consumption, transactions, and actions: The inception of BehaviorGPT. Unbox AI Research, 2025

2025
[20]

Brüel Gabrielsson and V

R. Brüel Gabrielsson and V . Gupta. BehaviorGPT at work: A foundation model for workforce actions and dynamics. Unbox AI Research, 2025

2025
[21]

Brüel Gabrielsson and V

R. Brüel Gabrielsson and V . Gupta. BehaviorGPT for visual art: A foundation model for aesthetics. Unbox AI Research, 2025

2025
[22]

Brüel Gabrielsson et al

R. Brüel Gabrielsson et al. Large behavioral models: A foundation-model paradigm for human actions. Unbox AI Research, 2026

2026
[23]

Järvelin and J

K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems, 20(4):422–446, 2002

2002
[24]

E. M. V oorhees. The TREC-8 question answering track report. InProceedings of TREC-8, 1999

1999
[25]

C. D. Manning, P. Raghavan, and H. Schütze.Introduction to Information Retrieval. Cambridge University Press, 2008

2008
[26]

loss” here and every ranking, coverage and entropy metric arefull-catalogue evaluationquantities, computed against the full ∼13.6M-item Stage-2 catalogue: “loss

J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems.ACM Transactions on Information Systems, 22(1):5–53, 2004. 17 A Metric Definitions Notation and candidate set.Every metric scores each query position against acandidate set C and ranks its items by the dot-product score zq,j =⟨h q, ej⟩, w...

work page arXiv 2004

[1] [1]

Scaling Laws for Neural Language Models

J. Kaplan et al. Scaling laws for neural language models.arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[2] [2]

Training Compute-Optimal Large Language Models

J. Hoffmann et al. Training compute-optimal large language models (Chinchilla). arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Yang and E

G. Yang and E. Hu. Tensor Programs IV: Feature learning in infinite-width neural networks (MuP). InICML, 2021

2021

[4] [4]

An Empirical Model of Large-Batch Training

S. McCandlish, J. Kaplan, D. Amodei, and the OpenAI Dota Team. An empirical model of large-batch training.arXiv:1812.06162, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Zhai et al

J. Zhai et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations (HSTU). InICML, 2024. 16

2024

[6] [6]

Zhang et al

B. Zhang et al. Wukong: Towards a scaling law for large-scale recommendation. arXiv:2403.02545, 2024

work page arXiv 2024

[7] [7]

Ardalani et al

N. Ardalani et al. Understanding scaling laws for recommendation models.arXiv:2208.08489, 2022

work page arXiv 2022

[8] [8]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InNeurIPS, 2023

2023

[9] [9]

Alayrac et al

J.-B. Alayrac et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022

2022

[10] [10]

J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023

2023

[11] [11]

Covington, J

P. Covington, J. Adams, and E. Sargin. Deep neural networks for YouTube recommendations. InACM RecSys, 2016

2016

[12] [12]

Bengio and J.-S

Y . Bengio and J.-S. Senécal. Adaptive importance sampling to accelerate training of a neural probabilistic language model.IEEE Transactions on Neural Networks, 19(4):713–722, 2008

2008

[13] [13]

Foundation Model for Personalized Recommendation

Netflix Technology Blog. Foundation Model for Personalized Recommendation. Mar. 2025. https://netflixtechblog.com/ foundation-model-for-personalized-recommendation-1a0bd8e02d39

2025

[14] [14]

C.-C. M. Yeh, U. S. Saini, X. Dai, X. Fan, S. Jain et al. TREASURE: A transformer-based foundation model for high-volume transaction understanding (Visa Payment Foundation Model). arXiv:2511.19693, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Y . Dou, Z. Jiang, T. Zhang, M. Hu, Z. Xu, Y . Chen et al. TransactionGPT.arXiv:2511.08939, 2025

work page arXiv 2025

[16] [16]

Kedia and the Stripe Machine Learning Team

G. Kedia and the Stripe Machine Learning Team. Stripe’s Payments Foundation Model. Stripe Sessions / Stripe Engineering, May 2025

2025

[17] [17]

PRAGMA: Revolut Foundation Model

V . Iashin et al. PRAGMA: Revolut foundation model.arXiv:2604.08649, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Kawawa-Beaudan, D

M. Kawawa-Beaudan, D. Borrajo, M. Veloso et al. TradeFM: A generative foundation model for trade-flow and market microstructure.arXiv:2602.23784, 2026

work page arXiv 2026

[19] [19]

Brüel Gabrielsson et al

R. Brüel Gabrielsson et al. A foundation model for consumption, transactions, and actions: The inception of BehaviorGPT. Unbox AI Research, 2025

2025

[20] [20]

Brüel Gabrielsson and V

R. Brüel Gabrielsson and V . Gupta. BehaviorGPT at work: A foundation model for workforce actions and dynamics. Unbox AI Research, 2025

2025

[21] [21]

Brüel Gabrielsson and V

R. Brüel Gabrielsson and V . Gupta. BehaviorGPT for visual art: A foundation model for aesthetics. Unbox AI Research, 2025

2025

[22] [22]

Brüel Gabrielsson et al

R. Brüel Gabrielsson et al. Large behavioral models: A foundation-model paradigm for human actions. Unbox AI Research, 2026

2026

[23] [23]

Järvelin and J

K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems, 20(4):422–446, 2002

2002

[24] [24]

E. M. V oorhees. The TREC-8 question answering track report. InProceedings of TREC-8, 1999

1999

[25] [25]

C. D. Manning, P. Raghavan, and H. Schütze.Introduction to Information Retrieval. Cambridge University Press, 2008

2008

[26] [26]

loss” here and every ranking, coverage and entropy metric arefull-catalogue evaluationquantities, computed against the full ∼13.6M-item Stage-2 catalogue: “loss

J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems.ACM Transactions on Information Systems, 22(1):5–53, 2004. 17 A Metric Definitions Notation and candidate set.Every metric scores each query position against acandidate set C and ranks its items by the dot-product score zq,j =⟨h q, ej⟩, w...

work page arXiv 2004