Recognition: 3 theorem links
· Lean TheoremReinforced Self-Training (ReST) for Language Modeling
Pith reviewed 2026-05-13 07:56 UTC · model grok-4.3
The pith
ReST improves large language model outputs for machine translation by using offline reinforcement learning on self-generated samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReST produces a dataset by generating samples from an initial LLM policy and then improves the policy using offline RL algorithms, resulting in substantially better translation quality on machine translation benchmarks while being more efficient than online RLHF.
What carries the argument
Reinforced Self-Training (ReST), which generates samples offline from the current policy and reuses them for offline RL policy improvement.
If this is right
- ReST allows data reuse, making it more compute and sample efficient than typical online RLHF.
- Translation quality improves substantially as measured by automated metrics on MT benchmarks.
- Human evaluations confirm the quality gains on machine translation tasks.
- ReST is applicable as a general approach to other generative learning settings beyond translation.
Where Pith is reading between the lines
- ReST could reduce reliance on real-time human feedback by using static datasets for alignment.
- Models trained this way might maintain improvements across multiple iterations if the offline data captures sufficient preference information.
- Testing ReST on non-translation tasks like summarization could reveal its broader utility.
- Combining ReST with online methods might yield hybrid approaches for even better alignment.
Load-bearing premise
A one-time offline dataset generated from the initial model policy is sufficient to achieve stable alignment improvements through offline RL without further online updates or new feedback.
What would settle it
Running ReST on a standard machine translation benchmark and observing no improvement or worse performance in BLEU scores or human preference ratings compared to the base model would falsify the effectiveness claim.
read the original abstract
Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences. We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by generating samples from the policy, which are then used to improve the LLM policy using offline RL algorithms. ReST is more efficient than typical online RLHF methods because the training dataset is produced offline, which allows data reuse. While ReST is a general approach applicable to all generative learning settings, we focus on its application to machine translation. Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Reinforced Self-Training (ReST), a simple offline RL algorithm for aligning LLMs with human preferences. Starting from an initial policy, ReST generates a fixed dataset of samples, filters or scores them (implicitly via a reward model or metric), and then applies offline RL to update the policy. The method is positioned as more efficient than online RLHF due to data reuse. The central empirical claim is that ReST yields substantial gains in machine translation quality on standard benchmarks, as measured by automated metrics and human evaluation, while remaining compute- and sample-efficient.
Significance. If the empirical results are robust, ReST would provide a practical, lower-cost route to preference alignment that avoids repeated online sampling and human feedback loops. The emphasis on offline data reuse and applicability beyond MT could influence how future alignment pipelines are designed, especially in settings where generating fresh trajectories is expensive.
major comments (2)
- [Method] Method section (description of ReST): the central claim that offline RL on a static dataset generated from the initial policy produces stable, meaningful alignment gains is load-bearing, yet the manuscript provides no ablation that replaces the offline RL update with supervised fine-tuning on the identical filtered high-reward samples. Without this comparison, it is impossible to isolate whether observed BLEU/human-score improvements stem from the RL objective or simply from training on higher-quality filtered data.
- [Experiments] Experimental results (MT benchmarks): the abstract and results claim 'substantial' improvements via automated metrics and human evaluation, but the reported numbers, baseline comparisons, and variance across runs are not quantified in sufficient detail to assess effect size or statistical reliability. This directly affects the efficiency and superiority claims relative to standard RLHF.
minor comments (2)
- [Method] The notation for the reward model and the precise offline RL objective (e.g., which algorithm is used and how the advantage or value estimates are computed) should be stated explicitly with equations rather than described at a high level.
- [Experiments] Figure captions and table headers should include the exact number of samples, compute budget, and reward model details so that the efficiency claims can be reproduced.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Method] Method section (description of ReST): the central claim that offline RL on a static dataset generated from the initial policy produces stable, meaningful alignment gains is load-bearing, yet the manuscript provides no ablation that replaces the offline RL update with supervised fine-tuning on the identical filtered high-reward samples. Without this comparison, it is impossible to isolate whether observed BLEU/human-score improvements stem from the RL objective or simply from training on higher-quality filtered data.
Authors: We agree that an explicit ablation replacing the offline RL update with supervised fine-tuning on the identical filtered high-reward samples would strengthen the isolation of the RL objective's contribution. While the manuscript already includes comparisons to standard fine-tuning baselines, it does not contain this precise control. In the revision we will add the requested ablation, which we expect to show additional gains from the RL step beyond SFT on filtered data, thereby supporting the central claim. revision: yes
-
Referee: [Experiments] Experimental results (MT benchmarks): the abstract and results claim 'substantial' improvements via automated metrics and human evaluation, but the reported numbers, baseline comparisons, and variance across runs are not quantified in sufficient detail to assess effect size or statistical reliability. This directly affects the efficiency and superiority claims relative to standard RLHF.
Authors: We acknowledge that more granular reporting is needed to evaluate effect sizes and reliability. In the revised manuscript we will expand the results section with full numerical tables (including means and standard deviations across runs), additional baseline details, and statistical significance measures where appropriate. These additions will better substantiate the efficiency and improvement claims. revision: yes
Circularity Check
No significant circularity: ReST is a direct application of offline RL to generated data
full rationale
The paper describes ReST as first sampling trajectories from an initial policy to build a static dataset, then applying standard offline RL (e.g., filtered supervised fine-tuning or similar) to update the policy. This chain relies on external RL algorithms and empirical evaluation on MT benchmarks rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or steps reduce the claimed improvements to the inputs by construction; the method is self-contained against standard offline RL practice and does not invoke uniqueness theorems or ansatzes from the authors' prior work to force the result.
Axiom & Free-Parameter Ledger
free parameters (1)
- reward model parameters
axioms (1)
- domain assumption Offline RL on a fixed dataset generated by the current policy can improve the policy toward human preferences
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
ReST produces a dataset by generating samples from the policy, which are then used to improve the LLM policy using offline RL algorithms. ReST is more efficient than typical online RLHF methods because the training dataset is produced offline, which allows data reuse.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 29 Pith papers
-
ASH: Agents that Self-Hone via Embodied Learning
ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
-
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
-
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...
-
Near-Future Policy Optimization
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
-
Neural Garbage Collection: Learning to Forget while Learning to Reason
Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
-
EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop a...
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
VISTA uses prefix resampling and a vision-aware attention score to address data imbalance and language prior bias in self-improvement training of MLLMs, yielding up to 13.66% gains on reasoning tasks.
-
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
-
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
SCoL lets LLMs self-generate sparse layer updates via meta-RL to consolidate knowledge from context, outperforming prompting and fine-tuning baselines on QA and long-context tasks while aligning updates with high-Fish...
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
-
PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners
PAINT boosts on-policy self-distillation for LLM reasoning via adaptive partial-solution masking and entropy-mismatch interpolation, delivering consistent gains on math benchmarks across Qwen3 model scales.
-
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...
-
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
-
Beyond Importance Sampling: Rejection-Gated Policy Optimization
RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
-
SAM 3D: 3Dfy Anything in Images
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
-
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
-
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Reference graph
Works this paper leans on
-
[1]
A. Abdolmaleki, S. Huang, G. Vezzani, B. Shahriari, J. T. Springenberg, S. Mishra, D. Tirumala, A. Byravan, K. Bousmalis, A. György, et al. On multi-objective policy optimization as a tool for reinforcement learning.arXiv preprint arXiv:2106.08199,
-
[2]
R. Agarwal, M. Schwarzer, P. S. Castro, A. Courville, and M. G. Bellemare. Beyond tabula rasa: Reincarnating reinforcement learning.arXiv preprint arXiv:2206.01626,
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
URL http://github.com/deepmind. Y.Bai, A.Jones, K.Ndousse, A.Askell, A.Chen, N.DasSarma, D.Drain, S.Fort, D.Ganguli, T.Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4.arXiv preprint arXiv:2303.12712,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Bootstrappingpos-taggersusingunlabelleddata
S.Clark, J.R.Curran, andM.Osborne. Bootstrappingpos-taggersusingunlabelleddata. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 49–55,
work page 2003
- [6]
- [7]
-
[8]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.2. 12 Reinforced Self-Training (ReST) for Language Modeling J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4RL: Datasets for deep data-driven reinforce- ment learning.arXiv preprint arXiv:2004.07219,
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [9]
-
[10]
B. Ghorbani, O. Firat, M. Freitag, A. Bapna, M. Krikun, X. Garcia, C. Chelba, and C. Cherry. Scaling laws for neural machine translation.arXiv preprint arXiv:2109.07740,
-
[11]
Improving alignment of dialogue agents via targeted human judgements
A. Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375,
work page internal anchor Pith review arXiv
-
[12]
C. Gulcehre, S. G. Colmenarejo, Z. Wang, J. Sygnowski, T. Paine, K. Zolna, Y. Chen, M. Hoffman, R. Pas- canu, and N. de Freitas. Regularized behavior value estimation.arXiv preprint arXiv:2103.09575,
- [13]
- [14]
- [15]
-
[16]
Y. Lu, S. Singhal, F. Strub, A. Courville, and O. Pietquin. Countering language drift with seeded iterated learning. InInternational Conference on Machine Learning, 2020a. Y. Lu, S. Singhal, F. Strub, O. Pietquin, and A. Courville. Supervised seeded iterated learning for interactive language learning.arXiv preprint arXiv:2010.02975, 2020b. M. Mathieu, S. ...
-
[17]
Training language models to follow instructions with human feedback
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, 2022a. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwr...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Red Teaming Language Models with Language Models
E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Robyn Speer, Joshua Chin, and Catherine Havasi
I. Shumailov, Z. Shumaylov, Y. Zhao, Y. Gal, N. Papernot, and R. Anderson. The curse of recursion: Training on generated data makes models forget.arXiv preprint arxiv:2305.17493,
-
[23]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Solving math word problems with process- and outcome-based feedback
J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Hig- gins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275,
work page internal anchor Pith review Pith/arXiv arXiv
- [25]
- [26]
-
[27]
A. Appendix A.1. RLHF for conditional language modeling as MDP We can formulate conditional language modeling as a sequence to sequence problem. The goal is to map a source sequence𝒙 = (𝑥1, 𝑥2, ...𝑥 𝐿) into a target sequence𝒚 = ( 𝑦1, 𝑦2, .... 𝑦𝑇 ), that is to learn a mapping from𝒙 to 𝒚. Machine translation is a classic example of a sequence to sequence pr...
work page 2014
-
[28]
for building the tokenizers. Growstep During theGrow step, we sampled from the latest checkpoint of the policy with tempered softmax using temperature0.8 following the procedure proposed by Li et al. (2022) to generate the dataset. Morevero, in our analysis, we found that temperature0.8 often covers a broad range of rewards in the dataset. Thresholdsin Im...
work page 2022
-
[29]
WMT 2020 Zh-En We use the source-reference pairs in Chinese and English from the work of Koehn et al
architecture with the feedforward MLP layers of size 512, feedforward dimension of1024, 4 attention heads and6 encoder and decoder layers. WMT 2020 Zh-En We use the source-reference pairs in Chinese and English from the work of Koehn et al. (2020) for our training, validation and test sets. Exact details on the datasets and preprocessing can be found in Y...
work page 2020
-
[30]
architecture with model dimension 1024, feedforward dimension of8192, 16 attention heads and6 encoder and decoder layers. In Table 2, we list all the datasets with their sizes. In all the experiments, unless stated otherwise, we report the average reward scores on the validation set. 5For computational reasons, we runReST on the fine-tuning corpus with a ...
work page 2014
-
[31]
The plots on the left-hand side are for the samples generated from a supervised baseline and the right-hand side are for the samples generated withReST. Figure 10 | [WMT 2020 Zh-En]:distribution of human preference and reward model scores forReST (BC, I=4, G=1 ) in side by side evaluation with supervised model. The human preference scores lower than or eq...
work page 2020
-
[32]
The results are consistent with WMT dataset: in short,ReST with BC loss and multipleImprove steps outperforms other approaches. Selecting threshold based on percentiles of reward model scoresUsing a single threshold for all the source-candidate pairs may lead to a scenario with no training data for certain (harder) source 20 Reinforced Self-Training (ReST...
work page 2020
-
[33]
In a nutshell, computing the threshold by interpolating the max and mean scores for a given candidate gives results similar to the percentile- based way of computing the thresholds per source. Also we can see that the schedule of thresholds 21 Reinforced Self-Training (ReST) for Language Modeling Figure 16 | [IWSLT 2014 De-En] interpolation experiments:Th...
work page 2014
-
[34]
BVMPO The BVMPO approach is similar to DIME proposed by Abdolmaleki et al
A.8.1. BVMPO The BVMPO approach is similar to DIME proposed by Abdolmaleki et al. (2021). The main difference is that we use a state-value function instead of Q funtion with v-trace (Espeholt et al., 2018), similarly to V-MPO (Song et al., 2020). We train separate neural networks for policy and value function. The policy is pre-trained with BC and the val...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.