Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning

Kim Phuc Tran

arxiv: 2605.22940 · v1 · pith:YFK4RKHNnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· stat.ML

Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning

Kim Phuc Tran This is my paper

Pith reviewed 2026-05-25 06:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords entropy regularizationrepresentation learningdynamical systemsinformation forcegeometric entropyscaling lawsopen learning systems

0 comments

The pith

Entropy regularization in learning is effective only when the surrogate produces a non-degenerate information force along the optimization trajectory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames deep learning as an open dynamical process rather than closed optimization and claims that entropy regularization contributes only when the chosen surrogate creates a sustained information force. Without that force the dynamics revert to ordinary loss minimization and the regularization term adds little. Geometric proxies based on variance or log-determinant of covariance are shown to generate stronger, more stable forces than the conventional softmax entropy. The framework supplies convergence, flow, and generalization results under explicit conditions and offers a conditional account of scaling behavior as a balance of injection and dissipation. Controlled experiments are presented as support for preferring the geometric surrogates.

Core claim

Entropy regularization is useful only when the chosen entropy surrogate generates a non-degenerate information force along the optimization trajectory; otherwise entropy terms may produce weak, unstable, or misaligned gradients. The paper introduces effective entropy and demonstrates that geometric entropy surrogates, especially the log-determinant covariance proxy, induce stronger and more stable information forces than softmax-normalized entropy. It derives convergence, entropy-flow, Wasserstein-gradient-flow, and noisy-representation generalization results under explicit assumptions and gives a conditional dynamical interpretation of scaling-law-like behavior.

What carries the argument

The effective information force produced by an entropy surrogate along the optimization trajectory, with geometric proxies such as log-determinant covariance entropy serving as the mechanism that keeps the force non-degenerate.

If this is right

Convergence, entropy-flow, and Wasserstein-gradient-flow results hold when the information force remains non-degenerate.
A conditional dynamical interpretation links scaling-law-like behavior to the balance between information injection, entropy dissipation, and residual risk.
Geometric surrogates, especially log-determinant covariance entropy, produce stronger and more stable forces than softmax entropy in the reported experiments.
The framework applies to open systems that operate under uncertainty, resource limits, distribution shift, and human feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of continual-learning systems might test covariance-based surrogates to maintain stable forces when data distributions change over time.
The same force condition could guide regularization choices in settings where human feedback directly modifies the training objective.
If the non-degenerate requirement can be monitored during training, it may reduce reliance on exhaustive search over entropy coefficients.

Load-bearing premise

That tractable geometric entropy surrogates can be chosen and tuned so they reliably produce non-degenerate information forces in real training trajectories.

What would settle it

A controlled representation-learning run in which log-determinant covariance entropy yields weaker or less stable gradients than softmax-normalized entropy would falsify the central hypothesis.

Figures

Figures reproduced from arXiv: 2605.22940 by Kim Phuc Tran.

**Figure 3.** Figure 3: Information injection and entropy dissipation as competing scale [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Effective information ratio across scale. Stable scaling corresponds [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Transformer RL-HCLM: test loss versus β using softmax entropy. Softmax entropy produces weak and unstable improvements, confirming its limited effectiveness as an information surrogate [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Transformer RL-HCLM: generalization gap versus [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Transformer RL-HCLM: information force versus [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Transformer RL-HCLM: representation entropy versus [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Transformer RL-HCLM: thermostat coefficient [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Transformer RL-HCLM: human/RL reward versus [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: , both Thermostat and RL-thermostat methods stabilize and lower the test loss compared to fixed hybrid control. More importantly, [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Transformer RL-HCLM: generalization gap versus [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Transformer RL-HCLM: information force versus [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Transformer RL-HCLM: thermostat coefficient [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Transformer RL-HCLM: human/RL reward versus [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Transformer RL-HCLM: test loss versus β using log-determinant entropy. Thermostat and RL-thermostat regimes stabilize the test loss compared with fixed hybrid control. The superiority of this geometric surrogate is highlighted in [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Transformer RL-HCLM: generalization gap versus [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Transformer RL-HCLM: information force versus [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Transformer RL-HCLM: representation entropy versus [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: Transformer RL-HCLM: thermostat coefficient [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

**Figure 21.** Figure 21: Transformer RL-HCLM: human/RL reward versus [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗

**Figure 22.** Figure 22: Transformer RL-HCLM: test-loss dynamics for log-determinant en [PITH_FULL_IMAGE:figures/full_fig_p030_22.png] view at source ↗

**Figure 23.** Figure 23: Transformer RL-HCLM: information-force dynamics for log [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗

**Figure 24.** Figure 24: Transformer RL-HCLM: entropy dynamics for log-determinant en [PITH_FULL_IMAGE:figures/full_fig_p031_24.png] view at source ↗

**Figure 25.** Figure 25: Transformer RL-HCLM: test-loss dynamics for log-determinant en [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗

**Figure 26.** Figure 26: Transformer RL-HCLM: information-force dynamics for log [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗

**Figure 27.** Figure 27: Transformer RL-HCLM: adaptive βt dynamics for log-determinant entropy under thermostat control. The controller maintains a stable dissipation coefficient after an initial transient. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗

**Figure 28.** Figure 28: Transformer RL-HCLM: reward dynamics for log-determinant en [PITH_FULL_IMAGE:figures/full_fig_p033_28.png] view at source ↗

**Figure 29.** Figure 29: Transformer RL-HCLM: test-loss dynamics for log-determinant en [PITH_FULL_IMAGE:figures/full_fig_p034_29.png] view at source ↗

**Figure 30.** Figure 30: Transformer RL-HCLM: information-force dynamics for log [PITH_FULL_IMAGE:figures/full_fig_p034_30.png] view at source ↗

**Figure 31.** Figure 31: Transformer RL-HCLM: adaptive βt dynamics for log-determinant entropy under RL-thermostat control. RL feedback modulates the entropy coefficient while keeping it in a stable range [PITH_FULL_IMAGE:figures/full_fig_p035_31.png] view at source ↗

**Figure 32.** Figure 32: Transformer RL-HCLM: reward dynamics for log-determinant en [PITH_FULL_IMAGE:figures/full_fig_p035_32.png] view at source ↗

**Figure 1.** Figure 1: The Human-Centered Learning Mechanics (HCLM) conceptual frame [PITH_FULL_IMAGE:figures/full_fig_p050_1.png] view at source ↗

read the original abstract

Deep learning is increasingly viewed as a dynamical process in parameter space, yet many existing theories still treat training as a closed optimization system. This view is limited for real-world AI, where models operate under uncertainty, resource constraints, distribution shift, downstream decision risks, and human feedback. We propose Human-Centered Learning Mechanics (HCLM), a dynamical and information-theoretic framework for open and controlled learning systems. The central idea is that entropy regularization is useful only when the chosen entropy surrogate generates a non-degenerate information force along the optimization trajectory. Otherwise, entropy terms may produce weak, unstable, or misaligned gradients, causing the dynamics to collapse toward ordinary loss minimization. We introduce the notion of effective entropy and study tractable geometric entropy surrogates, including variance-based and log-determinant covariance proxies. The paper makes three contributions. First, it formalizes entropy regularization through effective information force and characterizes degenerate entropy regimes. Second, it derives convergence, entropy-flow, Wasserstein-gradient-flow, and noisy-representation generalization results under explicit assumptions. Third, it offers a conditional dynamical interpretation of scaling-law-like behavior as a balance between information injection, entropy dissipation, and residual risk, without claiming an unconditional derivation of empirical neural scaling laws. Controlled representation-learning experiments support the hypothesis that geometric entropy surrogates, especially log-determinant covariance entropy, induce stronger and more stable information forces than softmax-normalized entropy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper conditions when entropy regularization helps on non-degenerate information forces from geometric surrogates like log-det covariance, but the derivations and experiments stay high-level with no visible equations or numbers.

read the letter

The paper's main point is that entropy regularization only matters in dynamical learning if the chosen surrogate produces a non-degenerate information force along the trajectory; otherwise it just adds weak or misaligned gradients and the system reverts to ordinary loss minimization. They introduce geometric proxies such as variance-based and log-determinant covariance entropy as tractable alternatives to softmax entropy and claim these work better in controlled representation-learning runs. They also give a conditional reading of scaling-law behavior as a balance of information injection, entropy dissipation, and residual risk rather than an unconditional law.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Human-Centered Learning Mechanics (HCLM), a dynamical and information-theoretic framework for entropy-regulated representation learning in open systems. Its central claim is that entropy regularization is useful only when the chosen entropy surrogate generates a non-degenerate information force along the optimization trajectory; otherwise gradients are weak or misaligned. It introduces tractable geometric entropy surrogates (variance-based and log-determinant covariance proxies), derives convergence, entropy-flow, Wasserstein-gradient-flow, and noisy-representation generalization results under explicit assumptions, and offers a conditional dynamical interpretation of scaling-law-like behavior as a balance between information injection, entropy dissipation, and residual risk. Controlled experiments are presented as supporting that geometric surrogates, especially log-determinant covariance entropy, induce stronger and more stable information forces than softmax-normalized entropy.

Significance. If the conditional framework and experimental distinctions hold, the work could guide entropy regularization choices in representation learning under uncertainty and distribution shift by focusing on information-force non-degeneracy rather than blanket regularization. The explicit conditioning on assumptions and the refusal to claim an unconditional derivation of neural scaling laws are strengths that keep the contribution proportionate. The emphasis on geometric surrogates over softmax entropy offers a concrete, testable distinction if the non-degeneracy condition can be made operational.

major comments (3)

[Abstract] Abstract: the claim that 'controlled representation-learning experiments support the hypothesis' is load-bearing for the superiority of log-determinant covariance entropy, yet the abstract (and by extension the reported results) provides no equations, error bars, data-exclusion criteria, or fitting procedures, preventing assessment of whether the observed force differences are robust or artifactual.
[Abstract] Abstract / central claim: the usefulness of entropy regularization is conditioned on the surrogate generating a non-degenerate information force, but no a-priori, independent test for degeneracy is supplied; satisfaction of the condition for the proposed variance-based and log-determinant proxies is stated to require running the full optimization, rendering the conditional claim post-hoc rather than predictive.
[Abstract] Abstract: the derivations of convergence, entropy-flow, and generalization results are explicitly conditioned on assumptions whose verification for the geometric surrogates is not shown to be checkable without executing the trajectory, which is load-bearing for whether the framework delivers more than an empirical observation.

minor comments (2)

The notions of 'effective entropy' and 'information force' are introduced without a compact definition or notation table early in the manuscript, which would aid readers coming from standard information-theoretic regularization literature.
The distinction between the conditional scaling-law interpretation and unconditional empirical scaling laws is valuable but would benefit from a short dedicated paragraph contrasting the two to prevent misreading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment point by point below, with honest indications of where revisions are feasible or where we maintain our position on substantive grounds.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'controlled representation-learning experiments support the hypothesis' is load-bearing for the superiority of log-determinant covariance entropy, yet the abstract (and by extension the reported results) provides no equations, error bars, data-exclusion criteria, or fitting procedures, preventing assessment of whether the observed force differences are robust or artifactual.

Authors: We agree that the abstract's brevity limits inclusion of full experimental metadata. The main text (Section 4) and supplementary material contain the equations for the force metrics, error bars from multiple runs, data-exclusion criteria, and fitting procedures. We will revise the abstract to add a concise clause referencing these sections and summarizing the observed force stability differences, while respecting length constraints. revision: partial
Referee: [Abstract] Abstract / central claim: the usefulness of entropy regularization is conditioned on the surrogate generating a non-degenerate information force, but no a-priori, independent test for degeneracy is supplied; satisfaction of the condition for the proposed variance-based and log-determinant proxies is stated to require running the full optimization, rendering the conditional claim post-hoc rather than predictive.

Authors: The conditional claim is a deliberate feature of the dynamical framework: non-degeneracy is defined along the optimization trajectory in open systems and cannot be certified by a static, trajectory-independent test without altering the core thesis. The experiments demonstrate how the geometric surrogates satisfy the condition in practice; this is not post-hoc but follows from the information-force analysis. We will add a clarifying sentence in the introduction to distinguish the dynamical condition from a predictive pre-check, without changing the claim itself. revision: no
Referee: [Abstract] Abstract: the derivations of convergence, entropy-flow, and generalization results are explicitly conditioned on assumptions whose verification for the geometric surrogates is not shown to be checkable without executing the trajectory, which is load-bearing for whether the framework delivers more than an empirical observation.

Authors: The explicit conditioning on assumptions is intentional to bound the theoretical results and avoid overclaiming. In a dynamical setting, verifying trajectory-dependent quantities such as force non-degeneracy inherently requires observing the optimization path; this does not reduce the contribution to empiricism, as the derivations supply the precise conditions under which the stated convergence and generalization hold. We will ensure the revised manuscript reiterates this scope in the theory sections. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivations remain conditional and self-contained.

full rationale

The provided abstract and description present derivations of convergence, entropy-flow, and related results explicitly conditioned on assumptions, along with a conditional (not unconditional) interpretation of scaling-law-like behavior. No equations, self-citations, or fitted-parameter renamings are quoted that would reduce any claimed prediction or result to its inputs by construction. The framework conditions usefulness on non-degenerate information forces but does not exhibit a self-definitional loop or load-bearing self-citation chain in the given text; experimental support is presented separately from the derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The central claim rests on the unstated assumption that geometric surrogates can be made non-degenerate, but no ledger entries can be extracted.

pith-pipeline@v0.9.0 · 5782 in / 1276 out tokens · 24431 ms · 2026-05-25T06:10:02.109213+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

entropy regularization is useful only when the chosen entropy surrogate generates a non-degenerate information force along the optimization trajectory
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

log-determinant covariance surrogate eHlogdet(Z) = ½ log det(ΣZ + ϵI)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

[1]

and Bialek, William , title =

Tishby, Naftali and Pereira, Fernando C. and Bialek, William , title =. Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , year =

work page
[2]

Neural Tangent Kernel: Convergence and Generalization in Neural Networks , booktitle =

Jacot, Arthur and Gabriel, Franck and Hongler, Cl. Neural Tangent Kernel: Convergence and Generalization in Neural Networks , booktitle =

work page
[3]

and Kaur, Simran and Li, Yuanzhi and Kolter, J

Cohen, Jeremy M. and Kaur, Simran and Li, Yuanzhi and Kolter, J. Zico and Talwalkar, Ameet , title =. International Conference on Learning Representations , year =

work page
[4]

Opening the Black Box of Deep Neural Networks via Information

Shwartz-Ziv, Ravid and Tishby, Naftali , title =. 2017 , archivePrefix =. 1703.00810 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

and Fischer, Ian and Dillon, Joshua V

Alemi, Alexander A. and Fischer, Ian and Dillon, Joshua V. and Murphy, Kevin , title =. International Conference on Learning Representations , year =

work page
[6]

, title =

McAllester, David A. , title =. Proceedings of the Twelfth Annual Conference on Computational Learning Theory , year =

work page
[7]

Catoni, Olivier , title =

work page
[8]

, title =

Dziugaite, Gintare Karolina and Roy, Daniel M. , title =. Uncertainty in Artificial Intelligence , year =

work page
[9]

On the Global Convergence of Gradient Descent for Over-parameterized Models Using Optimal Transport , booktitle =

Chizat, L. On the Global Convergence of Gradient Descent for Over-parameterized Models Using Optimal Transport , booktitle =

work page
[10]

Proceedings of the National Academy of Sciences , year =

Mei, Song and Montanari, Andrea and Nguyen, Phan-Minh , title =. Proceedings of the National Academy of Sciences , year =

work page
[11]

Conference on Learning Theory , year =

Raginsky, Maxim and Rakhlin, Alexander and Telgarsky, Matus , title =. Conference on Learning Theory , year =

work page
[12]

Zico , title =

Donti, Priya and Amos, Brandon and Kolter, J. Zico , title =. Advances in Neural Information Processing Systems , year =

work page
[13]

and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , title =

Christiano, Paul F. and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , title =. Advances in Neural Information Processing Systems , year =

work page
[14]

Advances in Neural Information Processing Systems , volume =

Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, Ryan...

work page
[15]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , title =. 2020 , archivePrefix =. 2001.08361 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv 2020
[16]

Advances in Neural Information Processing Systems , year =

Xu, Aolin and Raginsky, Maxim , title =. Advances in Neural Information Processing Systems , year =

work page
[17]

Artificial Intelligence and Statistics , year =

Russo, Daniel and Zou, James , title =. Artificial Intelligence and Statistics , year =

work page
[18]

, title =

Bu, Yuheng and Zou, Shaofeng and Veeravalli, Venugopal V. , title =. IEEE Journal on Selected Areas in Information Theory , year =

work page
[19]

SIAM Journal on Mathematical Analysis , volume =

Jordan, Richard and Kinderlehrer, David and Otto, Felix , title =. SIAM Journal on Mathematical Analysis , volume =

work page
[20]

International Conference on Learning Representations , year =

Neyshabur, Behnam and Bhojanapalli, Srinadh and Srebro, Nathan , title =. International Conference on Learning Representations , year =

work page
[21]

International Conference on Learning Representations , year =

Foret, Pierre and Kleiner, Ariel and Mobahi, Hossein and Neyshabur, Behnam , title =. International Conference on Learning Representations , year =

work page
[22]

, title =

Clark, David G. , title =. 2025 , archivePrefix =. 2506.05303 , primaryClass =

work page arXiv 2025

[1] [1]

and Bialek, William , title =

Tishby, Naftali and Pereira, Fernando C. and Bialek, William , title =. Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , year =

work page

[2] [2]

Neural Tangent Kernel: Convergence and Generalization in Neural Networks , booktitle =

Jacot, Arthur and Gabriel, Franck and Hongler, Cl. Neural Tangent Kernel: Convergence and Generalization in Neural Networks , booktitle =

work page

[3] [3]

and Kaur, Simran and Li, Yuanzhi and Kolter, J

Cohen, Jeremy M. and Kaur, Simran and Li, Yuanzhi and Kolter, J. Zico and Talwalkar, Ameet , title =. International Conference on Learning Representations , year =

work page

[4] [4]

Opening the Black Box of Deep Neural Networks via Information

Shwartz-Ziv, Ravid and Tishby, Naftali , title =. 2017 , archivePrefix =. 1703.00810 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

and Fischer, Ian and Dillon, Joshua V

Alemi, Alexander A. and Fischer, Ian and Dillon, Joshua V. and Murphy, Kevin , title =. International Conference on Learning Representations , year =

work page

[6] [6]

, title =

McAllester, David A. , title =. Proceedings of the Twelfth Annual Conference on Computational Learning Theory , year =

work page

[7] [7]

Catoni, Olivier , title =

work page

[8] [8]

, title =

Dziugaite, Gintare Karolina and Roy, Daniel M. , title =. Uncertainty in Artificial Intelligence , year =

work page

[9] [9]

On the Global Convergence of Gradient Descent for Over-parameterized Models Using Optimal Transport , booktitle =

Chizat, L. On the Global Convergence of Gradient Descent for Over-parameterized Models Using Optimal Transport , booktitle =

work page

[10] [10]

Proceedings of the National Academy of Sciences , year =

Mei, Song and Montanari, Andrea and Nguyen, Phan-Minh , title =. Proceedings of the National Academy of Sciences , year =

work page

[11] [11]

Conference on Learning Theory , year =

Raginsky, Maxim and Rakhlin, Alexander and Telgarsky, Matus , title =. Conference on Learning Theory , year =

work page

[12] [12]

Zico , title =

Donti, Priya and Amos, Brandon and Kolter, J. Zico , title =. Advances in Neural Information Processing Systems , year =

work page

[13] [13]

and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , title =

Christiano, Paul F. and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , title =. Advances in Neural Information Processing Systems , year =

work page

[14] [14]

Advances in Neural Information Processing Systems , volume =

Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, Ryan...

work page

[15] [15]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , title =. 2020 , archivePrefix =. 2001.08361 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv 2020

[16] [16]

Advances in Neural Information Processing Systems , year =

Xu, Aolin and Raginsky, Maxim , title =. Advances in Neural Information Processing Systems , year =

work page

[17] [17]

Artificial Intelligence and Statistics , year =

Russo, Daniel and Zou, James , title =. Artificial Intelligence and Statistics , year =

work page

[18] [18]

, title =

Bu, Yuheng and Zou, Shaofeng and Veeravalli, Venugopal V. , title =. IEEE Journal on Selected Areas in Information Theory , year =

work page

[19] [19]

SIAM Journal on Mathematical Analysis , volume =

Jordan, Richard and Kinderlehrer, David and Otto, Felix , title =. SIAM Journal on Mathematical Analysis , volume =

work page

[20] [20]

International Conference on Learning Representations , year =

Neyshabur, Behnam and Bhojanapalli, Srinadh and Srebro, Nathan , title =. International Conference on Learning Representations , year =

work page

[21] [21]

International Conference on Learning Representations , year =

Foret, Pierre and Kleiner, Ariel and Mobahi, Hossein and Neyshabur, Behnam , title =. International Conference on Learning Representations , year =

work page

[22] [22]

, title =

Clark, David G. , title =. 2025 , archivePrefix =. 2506.05303 , primaryClass =

work page arXiv 2025