ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
Pith reviewed 2026-05-22 01:02 UTC · model grok-4.3
The pith
ActiveDPO selects preference data by letting the LLM itself judge which pairs will most improve alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ActiveDPO is an algorithm for active direct preference optimization that applies a theoretically grounded selection criterion valid for non-linear reward functions, with the LLM itself serving as the reward model that evaluates candidate preference pairs and thereby incorporates the model's specific influence into the data collection process.
What carries the argument
The active data selection criterion that uses the LLM as its own reward model to estimate how much each new preference pair will advance the alignment objective.
If this is right
- Higher alignment quality after the same number of human annotations.
- Lower total cost for building effective preference datasets.
- Direct handling of non-linear reward structures without restrictive simplifications.
- Consistent gains across different base models and real-world preference collections.
Where Pith is reading between the lines
- The same self-parameterized selection idea could transfer to other preference optimization loops such as RLHF variants.
- Lower annotation budgets may allow teams to iterate alignment more often or test more candidate models.
- Combining the criterion with synthetic data generation could shrink human involvement even further.
- Scale experiments on larger models would reveal whether the efficiency edge grows or saturates.
Load-bearing premise
That letting the LLM parameterize the reward model for data selection produces useful choices without creating circular dependencies or model-specific biases that cancel the gains.
What would settle it
A side-by-side run that collects the same number of preferences with ActiveDPO and with an otherwise identical method that uses an independent external reward model, then compares final alignment performance on standard benchmarks.
Figures
read the original abstract
The recent success in using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks, such as question answering, mathematical reasoning, and code generation. However, achieving effective LLM alignment depends on high-quality datasets of human preferences. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive assumptions about the reward function, such as linear latent reward functions. To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model used for active data selection. As a result, ActiveDPO explicitly accounts for the LLM's influence on data selection, unlike methods that select data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Our extensive experiments demonstrate that ActiveDPO outperforms existing methods across various models and real-world preference datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ActiveDPO, an algorithm for active data selection in Direct Preference Optimization (DPO) for LLM alignment. It introduces a theoretically grounded selection criterion that applies to non-linear reward functions by directly parameterizing the reward model with the target LLM itself. This is claimed to explicitly account for the LLM's influence on selection (unlike prior methods that ignore it or assume linear rewards), yielding more effective and sample-efficient preference data collection. Extensive experiments across models and real-world datasets are reported to show outperformance over existing active selection baselines.
Significance. If the central claims are supported, the work could meaningfully advance sample-efficient alignment by reducing reliance on costly human annotations through model-aware data selection. The extension of theoretical grounding to non-linear rewards and the direct use of the LLM for reward parameterization represent clear strengths, as does the reported experimental breadth. These elements, if rigorously validated, would provide a practical and theoretically motivated contribution to preference-based LLM training.
major comments (2)
- [§3] §3 (Method), selection criterion derivation: the construction parameterizes the reward model with the same LLM whose parameters are later updated by DPO on the selected pairs. This creates a potential self-referential dependency whose effect on the claimed independence of the selection criterion is not addressed; the theoretical grounding for non-linear rewards does not automatically guarantee that the resulting data distribution reduces the alignment gap rather than reinforcing the current model's manifold.
- [Experimental results] Experimental section (results tables): the reported gains in alignment metrics are presented without accompanying statistical significance tests, variance estimates across runs, or controls that isolate the effect of the LLM-parameterized selector from post-hoc hyperparameter choices. This weakens the claim that the method produces reliably superior data collections.
minor comments (2)
- [§3.1] Notation for the reward function r_θ and its relation to the policy π_θ should be clarified to avoid ambiguity when the same parameters appear in both the selection objective and the subsequent DPO loss.
- [Abstract / Introduction] The abstract and introduction would benefit from a concise statement of the precise assumptions under which the non-linear reward selection criterion remains valid.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the changes we will make in the revised version.
read point-by-point responses
-
Referee: [§3] §3 (Method), selection criterion derivation: the construction parameterizes the reward model with the same LLM whose parameters are later updated by DPO on the selected pairs. This creates a potential self-referential dependency whose effect on the claimed independence of the selection criterion is not addressed; the theoretical grounding for non-linear rewards does not automatically guarantee that the resulting data distribution reduces the alignment gap rather than reinforcing the current model's manifold.
Authors: We thank the referee for raising this point about the derivation. In the method, the reward model is parameterized by the LLM's current parameters at the time of selection; the DPO update is performed only after the batch has been chosen. This sequencing ensures the selection criterion is computed with fixed parameters and does not depend on the subsequent update. We will add an explicit statement of this ordering and a short paragraph discussing the independence property in the revised §3. Regarding the alignment gap versus manifold reinforcement, the derivation maximizes a lower bound on the expected DPO objective improvement, which targets better alignment by construction. We acknowledge that a deeper analysis of long-term distributional effects would be valuable and will include a brief discussion of this aspect along with a simple synthetic example in the revision. revision: partial
-
Referee: [Experimental results] Experimental section (results tables): the reported gains in alignment metrics are presented without accompanying statistical significance tests, variance estimates across runs, or controls that isolate the effect of the LLM-parameterized selector from post-hoc hyperparameter choices. This weakens the claim that the method produces reliably superior data collections.
Authors: We agree that the experimental presentation can be strengthened. In the revised manuscript we will report means and standard deviations over multiple independent runs (at least three) for all main results, include statistical significance tests (paired t-tests with p-values) comparing ActiveDPO against baselines, and add ablation experiments that isolate the contribution of the LLM-parameterized selector while holding other hyperparameters fixed. revision: yes
Circularity Check
No significant circularity: derivation remains independent of fitted inputs or self-referential definitions
full rationale
The paper's central proposal is an active selection criterion for DPO that is theoretically grounded for non-linear reward functions and explicitly uses the target LLM to parameterize the reward model. No equations or steps in the provided abstract reduce the selection criterion to a quantity defined by the alignment objective itself, nor does the construction rename a fitted parameter as a prediction. The design choice to leverage the LLM for selection is presented as an explicit accounting for its influence rather than a tautological loop. No self-citation chains, uniqueness theorems from prior author work, or ansatz smuggling are invoked in the abstract to justify the core claim. The method is therefore self-contained against external benchmarks of active learning for preference optimization.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we derived the uncertainty quantification on human preference for our LLM trained by DPO ... selection criterion ... argmax ||∇r_θt−1(x,y1)−∇r_θt−1(x,y2)||_{V^{-1}_{t−1}} (Eq. 3)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Google. Palm 2 technical report. arXiv:2305.10403, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Anthropic. Introducing claude 2.1. https://www.anthropic.com/news/claude-2-1/, 2023. [Online; accessed 01 February 2008]
work page 2023
-
[5]
Alpaca: A strong, replicable instruction-following model
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023. 9
work page 2023
-
[6]
Chi, Quoc V Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Proc. NeurIPS, 2022
work page 2022
-
[7]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv:2303.18223, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
AI Alignment: A Comprehensive Survey
Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, et al. Ai alignment: A comprehensive survey. arXiv:2310.19852, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E
Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, et al. Foundational challenges in assuring alignment and safety of large language models. arXiv:2404.09932, 2024
-
[11]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Proc. NeurIPS, pages 27730–27744, 2022
work page 2022
-
[12]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Proc. NeurIPS, 2023
work page 2023
-
[14]
Sample-efficient alignment for llms
Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, and Min Lin. Sample-efficient alignment for llms. arXiv:2411.01493, 2024
-
[15]
Deep bayesian active learning for preference modeling in large language models
Luckeciano Carvalho Melo, Panagiotis Tigas, Alessandro Abate, and Yarin Gal. Deep bayesian active learning for preference modeling in large language models. In Proc. NeurIPS, pages 118052–118085, 2024
work page 2024
-
[16]
Active preference learning for large language models
William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. In Proc. ICML, pages 36577–36590, 2024
work page 2024
-
[17]
Sample efficient reinforcement learning from human feedback via active exploration
Viraj Mehta, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Jeff Schneider, and Willie Neiswanger. Sample efficient reinforcement learning from human feedback via active exploration. arXiv:2312.00267, 2023
-
[18]
Active preference optimization for sample efficient rlhf
Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimization for sample efficient rlhf. In ICML 2024 Workshop on Theoretical Foundations of Foundation Models, 2024
work page 2024
-
[19]
Neural dueling bandits: Principled preference-based optimization with non-linear reward function
Arun Verma, Zhongxiang Dai, Xiaoqiang Lin, Patrick Jaillet, and Bryan Kian Hsiang Low. Neural dueling bandits: Principled preference-based optimization with non-linear reward function. In Proc. ICLR, 2025
work page 2025
-
[20]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Proc. ICLR, 2022
work page 2022
-
[21]
An elementary proof of a theorem of johnson and lindenstrauss
Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003
work page 2003
-
[22]
LESS: Selecting influential data for targeted instruction tuning
Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[23]
Learning to summarize from human feedback
Fei Liu et al. Learning to summarize from human feedback. In Proc. ACL, 2020
work page 2020
-
[24]
Tl; dr: Mining reddit to learn automatic summarization
Michael Völske, Martin Potthast, Shahbaz Syed, and Benno Stein. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017. 10
work page 2017
-
[25]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
DeBERTa large summarization reward model
OpenAssistant. DeBERTa large summarization reward model. https://huggingface.co/ OpenAssistant/reward-model-deberta-v3-large , 2024. Accessed: 2025-02-19
work page 2024
-
[28]
DeBERTa large summarization reward model v2
OpenAssistant. DeBERTa large summarization reward model v2. https://huggingface.co/ OpenAssistant/reward-model-deberta-v3-large-v2 , 2024. Accessed: 2025-02-19
work page 2024
-
[29]
Sentence-BERT: Sentence embeddings using Siamese BERT-networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proc. EMNLP. Association for Computational Linguistics, November 2019
work page 2019
-
[30]
Interactively optimizing information retrieval systems as a dueling bandits problem
Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proc. ICML, pages 1201–1208, 2009
work page 2009
-
[31]
Preference-based reinforcement learning: a formal framework and a policy iteration algorithm
Johannes Fürnkranz, Eyke Hüllermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, pages 123–156, 2012
work page 2012
-
[32]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Proc. NeurIPS, pages 4302–4310, 2017
work page 2017
-
[33]
Principled reinforcement learning with human feedback from pairwise or k-wise comparisons
Banghua Zhu, Michael Jordan, and Jiantao Jiao. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Proc. ICML, pages 43037–43067, 2023
work page 2023
-
[34]
Yisong Yue and Thorsten Joachims. Beat the mean bandit. In Proc. ICML, pages 241–248, 2011
work page 2011
-
[35]
The k-armed dueling bandits problem
Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, pages 1538–1556, 2012
work page 2012
-
[36]
Relative confidence sampling for efficient on-line ranker evaluation
Masrour Zoghi, Shimon A Whiteson, Maarten De Rijke, and Remi Munos. Relative confidence sampling for efficient on-line ranker evaluation. In Proc. WSDM, pages 73–82, 2014
work page 2014
-
[37]
Relative upper confidence bound for the k-armed dueling bandit problem
Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. In Proc. ICML, pages 10–18, 2014
work page 2014
-
[38]
Reducing dueling bandits to cardinal bandits
Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In Proc. ICML, pages 856–864, 2014
work page 2014
-
[39]
Regret lower bound and optimal algorithm in dueling bandit problem
Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in dueling bandit problem. In Proc. COLT, pages 1141–1154, 2015
work page 2015
-
[40]
A relative exponential weighing algorithm for adversarial utility-based dueling bandits
Pratik Gajane, Tanguy Urvoy, and Fabrice Clérot. A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In Proc. ICML, pages 218–227, 2015
work page 2015
-
[41]
Preference-based online learning with dueling bandits: A survey
Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference-based online learning with dueling bandits: A survey. Journal of Machine Learning Research, pages 1–108, 2021
work page 2021
-
[42]
Active human feedback collection via neural contextual dueling bandits
Arun Verma, Xiaoqiang Lin, Zhongxiang Dai, Daniela Rus, and Bryan Kian Hsiang Low. Active human feedback collection via neural contextual dueling bandits. arXiv:2504.12016, 2025
-
[43]
Optimal algorithms for stochastic contextual preference bandits
Aadirupa Saha. Optimal algorithms for stochastic contextual preference bandits. In Proc. NeurIPS, pages 30050–30062, 2021
work page 2021
-
[44]
Stochastic contextual dueling bandits under linear stochastic transitivity models
Viktor Bengs, Aadirupa Saha, and Eyke Hüllermeier. Stochastic contextual dueling bandits under linear stochastic transitivity models. In Proc. ICML, pages 1764–1786, 2022
work page 2022
-
[45]
Variance-aware regret bounds for stochastic contextual dueling bandits
Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud, and Quanquan Gu. Variance-aware regret bounds for stochastic contextual dueling bandits. arXiv:2310.00968, 2023
-
[46]
Feel-good thompson sampling for contextual dueling bandits
Xuheng Li, Heyang Zhao, and Quanquan Gu. Feel-good thompson sampling for contextual dueling bandits. arXiv:2404.06013, 2024
-
[47]
Online algorithm for unsupervised sensor selection
Arun Verma, Manjesh K Hanawal, Csaba Szepesvári, and Venkatesh Saligrama. Online algorithm for unsupervised sensor selection. In Proc. AISTATS, pages 3168–3176, 2019. 11
work page 2019
-
[48]
Thompson sampling for unsupervised sequential selection
Arun Verma, Manjesh K Hanawal, and Nandyala Hemachandra. Thompson sampling for unsupervised sequential selection. In Proc. ACML, pages 545–560, 2020
work page 2020
-
[49]
Online algorithm for unsupervised sequential selection with contextual information
Arun Verma, Manjesh K Hanawal, Csaba Szepesvári, and Venkatesh Saligrama. Online algorithm for unsupervised sequential selection with contextual information. In Proc. NeurIPS, pages 778–788, 2020
work page 2020
-
[50]
Robust Preference Learning-based Reinforcement Learning
Riad Akrour. Robust Preference Learning-based Reinforcement Learning . PhD thesis, Université Paris Sud-Paris XI, 2014
work page 2014
-
[51]
A survey of preference-based reinforcement learning methods
Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research , pages 1–46, 2017
work page 2017
-
[52]
Learning to summarize with human feedback
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In Proc. NeurIPS, pages 3008–3021, 2020
work page 2020
-
[53]
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. In Proc. ICML, pages 26874–26901, 2024
work page 2024
- [54]
-
[55]
Batch active preference-based learning of reward functions
Erdem Biyik and Dorsa Sadigh. Batch active preference-based learning of reward functions. In Proc. CRL, pages 519–528, 2018
work page 2018
-
[56]
Weitong Zhang, Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural Thompson sampling. In Proc. ICLR, 2021. A Appendix A.1 Computational resources, datasets and models Experiments are run on a server with AMD EPYC 7763 64-Core Processor, 1008GB RAM, and 8 NVIDIA L40 GPUs. Dataset license. TLDR dataset: MIT License; WebGPT dataset: Apache License 2.0. Model li...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.