pith. sign in

arxiv: 2606.05257 · v1 · pith:I7JBP5CGnew · submitted 2026-06-03 · 💻 cs.LG · cs.IR

Scaling Laws for Behavioral Foundation Models over User Event Sequences

Pith reviewed 2026-06-28 06:46 UTC · model grok-4.3

classification 💻 cs.LG cs.IR
keywords scaling lawsbehavioral foundation modelsuser event sequencesembedder parameterscompute optimal allocationnegative samplingranking metricsrecommendation models
0
0 comments X

The pith

A small embedder of about 2% of parameters is compute-optimal at every budget for behavioral foundation models on user event sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates scaling laws for training foundation models on sequences of user actions in areas like recommendation and commerce. It uses a two-part model with an event embedder and a transformer contextualizer, varying the parameter allocation, batch size, data allocation, and negative samples across 600 runs spanning many orders of compute. The key finding is that a small embedder fraction is always optimal because those parameters are costlier per step and see more repeats. It also reveals that the best data-to-parameter ratio shifts with scale and that loss and ranking metrics diverge in a scale-dependent way, making the metric choice part of the scaling law.

Core claim

Across 600 runs on real interaction data from 10^15 to 10^19 FLOPs, the optimal embedder size s* is approximately 2% of total parameters at every compute budget. Embedder parameters are more expensive per training step and are exposed to far more repeated items than the contextualizer parameters. As a result, compute-optimal training starts data-heavy relative to language models but the D/N ratio approaches the Chinchilla heuristic at higher compute. The sampled training objective and ranking metrics disagree in ways that scale with compute and metric choice, with larger budgets preferring more negatives until memory limits are hit.

What carries the argument

The parameter split between the feature-based event embedder and the decoder-only transformer, which determines the optimal allocation because of differing per-step costs and repetition rates.

If this is right

  • Embedder size should stay small even as total model size grows.
  • Data allocation should be larger relative to parameters at smaller compute budgets.
  • Negative sample count should increase with scale until candidate memory becomes the bottleneck.
  • The choice of ranking metric affects the optimal training configuration.
  • Scaling laws for these models must incorporate the evaluation metric as a variable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production systems could use this to allocate fewer parameters to embeddings and reduce overall training cost.
  • Similar scaling might hold for other sequence models with high item repetition.
  • Future work could test if unifying the embedder and contextualizer changes the optimal split.
  • Disagreement scaling suggests loss functions may need to be adjusted or combined with ranking objectives at large scales.

Load-bearing premise

The trends in optimal embedder fraction and scaling behaviors will continue to hold outside the specific real interaction datasets and two-part architecture used in the experiments.

What would settle it

A new set of scaling runs on different user event data or with a different architecture showing that the optimal embedder fraction changes substantially with compute budget or is not around 2%.

Figures

Figures reproduced from arXiv: 2606.05257 by Rickard Br\"uel Gabrielsson.

Figure 1
Figure 1. Figure 1: Width sweep: every headline eval metric vs. target embedder parameter share s. One curve per compute budget; solid lines are the per-(metric, budget) two-term starvation fits of (2). ⋆ = analytic s ⋆ inside the swept range s∈[1%, 50%]; + = boundary case (s ⋆ extrapolated outside the sweep, almost always coverage@10 at large C). The Kaplan-FLOP-share view of the same cells is Appendix [PITH_FULL_IMAGE:figu… view at source ↗
Figure 2
Figure 2. Figure 2: s ⋆ (C) summary, per metric. Visual companion to [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-metric Kaplan fits of (3); dotted lines mark B (m) crit and S min m per panel. val_loss and recall@10 coincide (Bcrit ≈574 and 544); NDCG@10 and MRR@10 sit notably lower; val_entropy has no plateau within our B range (Tab. 3 footnote). EWMA-smoothed trajectories and iso-target crossings are in Appendix 10 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Iso-FLOP curves per metric, with parabolic fits. Each panel scatters one headline metric vs. N along each compute budget and overlays the quadratic-in-log N fit per budget; ⋆ sits at the parabola’s analytic optimum (minimum for losses/entropy, maximum for ranking metrics). The plus-marker variant is used at (COVERAGE@10, C = 1018) where the coverage curve is essentially flat (a≈0) and the parabola maximum … view at source ↗
Figure 5
Figure 5. Figure 5: Chinchilla frontier per metric. Left: N⋆ (C); middle: D⋆ (C); right: D⋆/N⋆ . Switching the target from val_loss/recall@10 to NDCG@10 pushes the optimum toward smaller models trained on more tokens at every budget [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Negative-sampling K-sweep, per headline metric, with the per-(metric,C) starva￾tion/bias fit (6) overlaid. ⋆: analytic K⋆ inside the swept range K ∈[16k, 2M]; +: K⋆ extrapolated outside the sweep (boundary cases in [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Spearman rank correlations between headline metrics. Within either stage the loss/perplexity/entropy/ranking metrics are essentially one quantity; coverage is the only metric that decouples meaningfully [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Full-catalogue val_loss vs. recall@10. At fixed C, each column is one iso-FLOP budget; curves use the same deployed-catalogue eval pool (only training K differs). Red: val_loss (↓); green: recall@10 (↑); stars mark the swept-grid argmax per metric. saturates, which cells happen to spread the head distribution furthest is essentially independent of which cells minimize loss. High correlations are not the wh… view at source ↗
Figure 9
Figure 9. Figure 9: Batch-local vs. full-catalogue evaluation. Each point is one checkpoint scored under both regimes. Red squares mark Phase 4 K = 0 only. The first two rows are Stage 2 K-sweep strata (B = 64, C = 1015 and B = 512, C = 1019). The third row is the Phase 3 iso-FLOP architecture grid at C = 1019 , B = 512 (n= 6 cells re-evaluated under full-catalogue). Column 2 is the cross-metric panel: batch-local val_loss vs… view at source ↗
Figure 10
Figure 10. Figure 10: Embedder-share sweep (val_loss only). Left: validation loss vs. embedder share over s ∈ [0, 50]% at four compute budgets. Curves are monotone increasing in s at every budget over s∈[6, 50]%. Right: zoom on s∈[0, 6]% around the per-budget optimum. Solid lines are per-budget two-term starvation fits L(s) = E +a sα +b s−β ; stars mark the closed-form analytic optimum s ⋆ = (bβ/(aα))1/(α+β) ( [PITH_FULL_IMAG… view at source ↗
Figure 11
Figure 11. Figure 11: Width sweep vs. Kaplan embedder-side FLOP share f. Same cells as [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: N/D proxy diagnostics on the width sweep. Varying s moves compute between embedder and contextualizer at nearly fixed training-to-parameter ratio: Nemb+ctx/Dproxy is flat in s (a), while validation loss tracks s much more strongly than this residual ratio (b). Chinchilla diagnostic. Train and val parabolic minima need not coincide: they diverge notably at C = 1018 (39.5 M train vs. 19.4 M val) but land ne… view at source ↗
Figure 13
Figure 13. Figure 13: Depth-sweep parabolic fits behind [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Depth sweep: per-metric eval vs. s and Kaplan FLOP share f. Stars mark per-budget optima. Gray band: width-sweep s ⋆ ≈2% [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Per-metric trajectories vs. batch size. One curve per B; dashed line: iso-target Tm [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Training-loss channels at C = 1017 (not full-catalogue evaluation). Blue dashed: in-batch CE; red dashed: extra-negative CE (only for K > 0); black: their average, which SGD minimizes. The extra channel rises with K by construction; the in-batch channel moves with the checkpoint trained at each K. extra CE over the K sampled catalog negatives, averaged for optimization. These curves are not full-catalogue… view at source ↗
Figure 17
Figure 17. Figure 17: Analytic K⋆ (C) from (6), per metric (interior-cell summary of [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Negative-sampling efficiency surrogacy. Cell color is Spearman ρS between the K￾comparable in-batch training loss and the eval metric across the K axis (the optimized combined objective folds in a K-dependent extra-negative channel and is not comparable across K; cf [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: MuP halves the LR drift but does not improve loss. Solid: MuP, dashed: Default. MuP optima land at 10−2 at TINY, SMALL and LARGE, and at 5·10−3 at MEDIUM (a 0.30-decade band) versus 0.70 decades for Default. Default initialization sits below MuP at every model size [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Correlation matrices stratified by budget. The loss/perplexity/entropy/ranking block stays saturated at every C, but the coverage rows and columns visibly fade with scale. direction across K, so pushing K up reduces both) through near-zero at C ∈ {1016 , 1017} to −1.00 at C = 1018 (perfect alignment in the expected direction). (ii) The loss–coverage sign flips: −0.97– −0.22 in Stage 1 (bigger architecture… view at source ↗
Figure 18
Figure 18. Figure 18: Compare with the Stage 1 table (16): loss–ranking is no longer locked at [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 21
Figure 21. Figure 21: Same recall@10, very different absolute val_losses depending on stage. Blue: Stage 1 (batch-local pool); red: Stage 2 (full catalogue). Within either cloud |ρS|≥0.97; across clouds the link breaks because the partition function is computed over a ∼1500× larger set. H Context-Length Scoring Robustness (full slice matrices) This appendix backs axis (e) of §7. The eval pipeline stratifies every batch by scor… view at source ↗
Figure 22
Figure 22. Figure 22: Context-length scoring robustness: worst-case cell-ranking agreement across history￾position slices, per regime. Per (regime, metric, budget) we compute the Spearman ρ between every pair of scoring slices {ctx_3, ctx_5, ctx_10, ctx_20, ctx_50, ctx_100, val/all} and plot the minimum off-diagonal pair. Left: batch-local Stage 1 evals (architecture sweep): every headline ranking metric stays ρmin ≥0.93 at C … view at source ↗
Figure 23
Figure 23. Figure 23: Per-budget Spearman ρ between context-length scoring slices, batch-local evals (Phase 1W+1D+3 architecture sweep). Each row is a metric, each column a budget; the per-budget cell-count n is in the column header. Every cell is the Spearman rank correlation between two scoring-slice columns across the architectural cells at that budget. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Per-budget Spearman ρ between context-length scoring slices, full-catalogue evals (Phase 4 K-sweep). Same axes as [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗
read the original abstract

Foundation models are increasingly trained on sequences of user actions in recommendation, payments, fraud, and commerce, but these models still lack the kind of compute calibration that scaling laws provide for language models. We study a common two-part behavioral-model architecture: a feature-based event embedder maps each multi-modal item to a vector, and a decoder-only transformer predicts the next event from the resulting sequence. Across roughly 600 runs on real interaction data, spanning $10^{15}$-$10^{19}$ training FLOPs, we jointly vary four deployment-relevant axes: the two-part parameter split, critical batch size, model/data allocation, and the number of sampled negatives used after freezing the embedder. A small embedder ($s^{\star}\!\approx\!2\%$ of parameters) is compute-optimal at every budget we test because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters. Compute-optimal training is data-heavy relative to text at low compute, but its $D/N$ ratio moves toward the Chinchilla heuristic as compute increases. The sampled training objective and deployed ranking metrics disagree in ways that themselves scale: critical batch size, optimal negative count after freezing, and the agreement between loss and ranking quality all shift with compute and with the chosen evaluation metric. For negative sampling, larger budgets increasingly prefer more negatives; by $10^{19}$ FLOPs the active constraint is candidate-axis memory rather than FLOPs. In behavioral foundation models, the evaluation metric is therefore part of the scaling law: changing it can change the compute-optimal recipe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that for two-part behavioral foundation models (feature-based event embedder + decoder-only transformer) trained on user event sequences, a small embedder fraction s* ≈ 2% of total parameters is compute-optimal across all tested budgets from 10^15 to 10^19 FLOPs. This is attributed to embedder parameters being more expensive per step and exposed to more repeated items. The work also reports that compute-optimal D/N ratios start data-heavy relative to Chinchilla but approach it at higher compute, that critical batch size and optimal negative count after freezing shift with scale, and that sampled loss and ranking metrics disagree in ways that themselves scale with compute and metric choice. These conclusions rest on ~600 runs on real interaction datasets jointly varying parameter split, batch size, model/data allocation, and negative count.

Significance. If the central empirical trends hold, the paper supplies the first large-scale compute-optimal calibration for behavioral models on event sequences, directly relevant to recommendation, payments, and fraud domains. The scale of the experimental campaign (600 runs spanning five orders of magnitude in FLOPs) is a clear strength and provides substantial empirical support for the observed trends in embedder fraction and negative-sampling preferences.

major comments (2)
  1. [Abstract] Abstract: the explanatory claim that s* ≈ 2% optimality arises 'because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters' is load-bearing for the interpretation. All 600 runs use real interaction datasets that exhibit high item repetition; the manuscript reports no controlled experiments that vary repetition rate (e.g., synthetic data with adjustable Zipf exponents) while holding other factors fixed, leaving open the possibility that the observed optimum is an artifact of the repetition statistics rather than a general architectural property.
  2. [Abstract] Abstract and experimental description: the 2% optimality claim and the scaling trends for critical batch size and negative count lack reported error bars, statistical significance tests for the 2% figure, explicit data-exclusion rules, or analysis of whether post-hoc metric choices affect the central trends. These omissions make it difficult to assess robustness of the reported optima.
minor comments (1)
  1. [Abstract] Abstract: the symbol s* is used before any definition or parenthetical explanation; a brief inline clarification would improve readability for readers encountering the abstract first.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and detailed comments. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the explanatory claim that s* ≈ 2% optimality arises 'because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters' is load-bearing for the interpretation. All 600 runs use real interaction datasets that exhibit high item repetition; the manuscript reports no controlled experiments that vary repetition rate (e.g., synthetic data with adjustable Zipf exponents) while holding other factors fixed, leaving open the possibility that the observed optimum is an artifact of the repetition statistics rather than a general architectural property.

    Authors: We agree that our experiments are confined to real interaction datasets exhibiting high item repetition and that we did not perform controlled synthetic experiments varying repetition rates (e.g., via adjustable Zipf exponents). The 2% optimum and its proposed mechanism are therefore tied to the statistical properties of the real data used. While the trend is consistent across multiple distinct real-world datasets, this does not fully rule out dataset-specific artifacts. We will revise the abstract and discussion sections to present the explanation as a hypothesis grounded in architectural differences and observed repetition patterns in behavioral data, rather than a proven general causal factor, and will explicitly note the lack of synthetic controls as a limitation. revision: partial

  2. Referee: [Abstract] Abstract and experimental description: the 2% optimality claim and the scaling trends for critical batch size and negative count lack reported error bars, statistical significance tests for the 2% figure, explicit data-exclusion rules, or analysis of whether post-hoc metric choices affect the central trends. These omissions make it difficult to assess robustness of the reported optima.

    Authors: We acknowledge these gaps in the current version. In revision we will add error bars (computed from replicate runs where available) to all figures reporting the 2% optimum and scaling trends for batch size and negative count. We will include statistical significance tests for the identified optima. Data-exclusion criteria (based on convergence thresholds and outlier detection) will be stated explicitly in the experimental section. We will also add an analysis of how the central trends vary with different post-hoc metric choices and report the sensitivity of the scaling conclusions to metric selection. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical observations from 600 runs

full rationale

The paper presents scaling trends as outcomes of joint variation across four axes in ~600 real-data training runs spanning 10^15 to 10^19 FLOPs. Optimal embedder fraction s*≈2%, D/N ratios, negative counts, and metric disagreements are reported as measured quantities rather than quantities derived from equations that reduce to the inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text; the 'because' clause in the abstract is an interpretive summary of the observed trends, not a mathematical reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is empirical; the main ledger items are the observed optimal values rather than new theoretical constructs. The 2% embedder fraction is measured, not postulated a priori.

free parameters (1)
  • optimal embedder fraction s* = 0.02
    Observed as approximately 2% from the joint variation of parameter split across the 600 runs; used as the central reported optimum.
axioms (1)
  • domain assumption The two-part feature-based embedder plus decoder-only transformer is a suitable architecture for modeling user event sequences.
    Invoked by the choice of model family studied throughout the experiments.

pith-pipeline@v0.9.1-grok · 5813 in / 1419 out tokens · 39679 ms · 2026-06-28T06:46:47.207338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    J. Kaplan et al. Scaling laws for neural language models.arXiv:2001.08361, 2020

  2. [2]

    Training Compute-Optimal Large Language Models

    J. Hoffmann et al. Training compute-optimal large language models (Chinchilla). arXiv:2203.15556, 2022

  3. [3]

    Yang and E

    G. Yang and E. Hu. Tensor Programs IV: Feature learning in infinite-width neural networks (MuP). InICML, 2021

  4. [4]

    An Empirical Model of Large-Batch Training

    S. McCandlish, J. Kaplan, D. Amodei, and the OpenAI Dota Team. An empirical model of large-batch training.arXiv:1812.06162, 2018

  5. [5]

    Zhai et al

    J. Zhai et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations (HSTU). InICML, 2024. 16

  6. [6]

    Zhang et al

    B. Zhang et al. Wukong: Towards a scaling law for large-scale recommendation. arXiv:2403.02545, 2024

  7. [7]

    Ardalani et al

    N. Ardalani et al. Understanding scaling laws for recommendation models.arXiv:2208.08489, 2022

  8. [8]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InNeurIPS, 2023

  9. [9]

    Alayrac et al

    J.-B. Alayrac et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022

  10. [10]

    J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023

  11. [11]

    Covington, J

    P. Covington, J. Adams, and E. Sargin. Deep neural networks for YouTube recommendations. InACM RecSys, 2016

  12. [12]

    Bengio and J.-S

    Y . Bengio and J.-S. Senécal. Adaptive importance sampling to accelerate training of a neural probabilistic language model.IEEE Transactions on Neural Networks, 19(4):713–722, 2008

  13. [13]

    Foundation Model for Personalized Recommendation

    Netflix Technology Blog. Foundation Model for Personalized Recommendation. Mar. 2025. https://netflixtechblog.com/ foundation-model-for-personalized-recommendation-1a0bd8e02d39

  14. [14]

    C.-C. M. Yeh, U. S. Saini, X. Dai, X. Fan, S. Jain et al. TREASURE: A transformer-based foundation model for high-volume transaction understanding (Visa Payment Foundation Model). arXiv:2511.19693, 2025

  15. [15]

    Y . Dou, Z. Jiang, T. Zhang, M. Hu, Z. Xu, Y . Chen et al. TransactionGPT.arXiv:2511.08939, 2025

  16. [16]

    Kedia and the Stripe Machine Learning Team

    G. Kedia and the Stripe Machine Learning Team. Stripe’s Payments Foundation Model. Stripe Sessions / Stripe Engineering, May 2025

  17. [17]

    PRAGMA: Revolut Foundation Model

    V . Iashin et al. PRAGMA: Revolut foundation model.arXiv:2604.08649, 2026

  18. [18]

    Kawawa-Beaudan, D

    M. Kawawa-Beaudan, D. Borrajo, M. Veloso et al. TradeFM: A generative foundation model for trade-flow and market microstructure.arXiv:2602.23784, 2026

  19. [19]

    Brüel Gabrielsson et al

    R. Brüel Gabrielsson et al. A foundation model for consumption, transactions, and actions: The inception of BehaviorGPT. Unbox AI Research, 2025

  20. [20]

    Brüel Gabrielsson and V

    R. Brüel Gabrielsson and V . Gupta. BehaviorGPT at work: A foundation model for workforce actions and dynamics. Unbox AI Research, 2025

  21. [21]

    Brüel Gabrielsson and V

    R. Brüel Gabrielsson and V . Gupta. BehaviorGPT for visual art: A foundation model for aesthetics. Unbox AI Research, 2025

  22. [22]

    Brüel Gabrielsson et al

    R. Brüel Gabrielsson et al. Large behavioral models: A foundation-model paradigm for human actions. Unbox AI Research, 2026

  23. [23]

    Järvelin and J

    K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems, 20(4):422–446, 2002

  24. [24]

    E. M. V oorhees. The TREC-8 question answering track report. InProceedings of TREC-8, 1999

  25. [25]

    C. D. Manning, P. Raghavan, and H. Schütze.Introduction to Information Retrieval. Cambridge University Press, 2008

  26. [26]

    loss” here and every ranking, coverage and entropy metric arefull-catalogue evaluationquantities, computed against the full ∼13.6M-item Stage-2 catalogue: “loss

    J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems.ACM Transactions on Information Systems, 22(1):5–53, 2004. 17 A Metric Definitions Notation and candidate set.Every metric scores each query position against acandidate set C and ranks its items by the dot-product score zq,j =⟨h q, ej⟩, w...