pith. sign in

arxiv: 2606.20363 · v1 · pith:TAJ4DZ4Gnew · submitted 2026-06-18 · 💻 cs.AI

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

Pith reviewed 2026-06-26 17:00 UTC · model grok-4.3

classification 💻 cs.AI
keywords skill librarytrajectory miningGUI agentssegment clusteringpolicy trainingdiagnostic studycomputer-using agentsGRPO
0
0 comments X

The pith

Trajectory mining from GUI interactions produces readable skill clusters but does not reliably improve agent policies on new tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether explicit skill libraries for computer-using agents can be automatically generated from interaction trajectories in a way that enhances downstream performance. It implements a pipeline that segments trajectories, clusters the segments into skills, and then trains a skill-aware policy. On the source benchmark, the clusters show high purity matching known workflows, yet when used for training, the resulting policies show only slight gains in one metric and none in others. The authors conclude that current techniques for boundary detection, representation, and reward modeling fall short for cross-domain transfer. This serves as a diagnostic rather than a solution, highlighting specific bottlenecks in the mining approach.

Core claim

Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5% to 20.5%, leaves BrowseComp+ essentially unchanged, and underperforms

What carries the argument

A three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the annotations.

If this is right

  • Five of eight mined clusters achieve at least 0.95 purity against InteraSkill Workflows labels on the source benchmark.
  • GRPO training on the mined skills raises skill-step accuracy on IW from 18.5% to 20.5%.
  • Performance on the BrowseComp+ benchmark remains essentially unchanged after training.
  • The skill-aware policy underperforms simple frequency-based priors on several source-domain metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A boundary detector that more accurately identifies skill transitions could allow the same clustering step to produce clusters that support larger policy gains.
  • Replacing the orderless segment representation with one that preserves sequence order might capture dependencies that current clusters miss.
  • Switching from an offline reward model to one that is updated during policy training could reduce the gap to frequency priors observed on source metrics.
  • Applying the pipeline to additional held-out domains beyond IW and BrowseComp+ would test whether the reported insufficiency is specific to those benchmarks.

Load-bearing premise

High purity of mined clusters against InteraSkill Workflows labels indicates transferable skills that will improve policies on held-out benchmarks like BrowseComp+.

What would settle it

A controlled experiment that keeps all other components fixed but replaces the current boundary detector with one that achieves near-perfect segment boundaries, then measures whether GRPO training produces gains on BrowseComp+ larger than the observed zero change.

Figures

Figures reproduced from arXiv: 2606.20363 by Xiaomin Li, Yuexing Hao.

Figure 1
Figure 1. Figure 1: Study design for automated SKILL.md generation. IW is the source dataset for trajectory segmentation, skill-library construction, and Phase 3 GRPO policy training; WebArena and BrowseC￾omp+ are the completed held-out transfer checks. Mind2Web zero-shot and WorkArena-NLP are reported only as diagnostics, not as current GRPO transfer evidence. The paper evaluates boundary quality, cluster quality, auto-gener… view at source ↗
Figure 2
Figure 2. Figure 2: Data-efficiency comparison for generated [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5\% to 20.5\%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper describes a three-stage pipeline (trajectory segmentation, segment clustering into candidate skills, and GRPO-based skill-aware policy training) for automatically generating inspectable skill libraries from GUI interaction data. It reports that five of eight mined clusters achieve ≥0.95 purity against InteraSkill Workflows labels on the source domain, yet GRPO yields only a 2-point gain in IW skill-step accuracy (18.5% → 20.5%), no improvement on BrowseComp+, and underperforms frequency priors; the work is framed as a diagnostic study showing that current boundary detection, orderless segment representations, and offline rewards are insufficient for reliable cross-domain policy gains.

Significance. If the negative result is robust, the paper supplies a concrete falsification that high source-domain cluster purity does not imply transferable policy improvement, identifying three specific pipeline bottlenecks. This diagnostic framing is useful for the computer-using agents literature and avoids overclaiming positive transfer. The explicit comparison against both external labels and trivial baselines strengthens the evidentiary value.

major comments (2)
  1. [§4] §4 (Results): the central insufficiency claim rests on the GRPO vs. frequency-prior comparison and the +2% IW gain, yet the text provides only point estimates with no error bars, run-to-run variance, or statistical significance tests; without these, it is impossible to judge whether the observed gaps are reliable enough to support the conclusion that the three pipeline components are insufficient.
  2. [§3.2–3.3] §3.2–3.3 (Boundary detector and segment representation): the paper identifies these as load-bearing limitations but reports no ablation that isolates their individual contributions to the transfer failure (e.g., replacing the orderless representation with an ordered one while keeping the same clusters); the diagnostic conclusion therefore remains partly qualitative.
minor comments (2)
  1. [Abstract, §4] Abstract and §4: the purity and accuracy numbers are given without reference to the exact number of trajectories or episodes used, making it hard to assess sample size.
  2. [Figures] Figure captions: several figures lack axis labels or legend entries that would allow a reader to verify the reported purity and accuracy values directly from the plots.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and the recommendation for minor revision. The comments identify key areas where additional rigor can strengthen the diagnostic claims regarding the pipeline's limitations. Below we provide point-by-point responses.

read point-by-point responses
  1. Referee: [§4] §4 (Results): the central insufficiency claim rests on the GRPO vs. frequency-prior comparison and the +2% IW gain, yet the text provides only point estimates with no error bars, run-to-run variance, or statistical significance tests; without these, it is impossible to judge whether the observed gaps are reliable enough to support the conclusion that the three pipeline components are insufficient.

    Authors: We agree with this assessment. The current manuscript reports only single-run point estimates for the skill-step accuracy improvements. In the revised version, we will rerun the GRPO training multiple times to compute means and standard deviations, and include statistical tests (such as Wilcoxon signed-rank tests) comparing against the frequency prior baseline. This will allow readers to better evaluate the reliability of the +2% gain and the underperformance relative to priors. revision: yes

  2. Referee: [§3.2–3.3] §3.2–3.3 (Boundary detector and segment representation): the paper identifies these as load-bearing limitations but reports no ablation that isolates their individual contributions to the transfer failure (e.g., replacing the orderless representation with an ordered one while keeping the same clusters); the diagnostic conclusion therefore remains partly qualitative.

    Authors: We acknowledge that the identification of specific bottlenecks is based on the overall experimental outcomes rather than isolated ablations. Conducting the suggested ablations would require substantial additional engineering and compute to implement ordered segment representations and alternative boundary detectors while controlling for other variables. As this is framed as a diagnostic study highlighting insufficiencies, we believe the current evidence suffices to motivate future work on these components. We will revise the text to more clearly state the qualitative basis of these claims and their implications. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is framed explicitly as a diagnostic study: it mines clusters from trajectories, reports high source-domain purity (0.95 on five of eight clusters vs. InteraSkill labels), then shows that the resulting annotations yield only marginal GRPO gains (+2% IW accuracy) and no BrowseComp+ improvement while underperforming frequency priors. This negative result is supported by direct empirical comparisons to external baselines and does not rely on any derivation that reduces a claimed prediction or uniqueness result to fitted parameters, self-citations, or definitional equivalence. No load-bearing step invokes the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work is presented as an empirical diagnostic without explicit modeling assumptions or new postulated entities.

pith-pipeline@v0.9.1-grok · 5685 in / 1111 out tokens · 41990 ms · 2026-06-26T17:00:44.083402+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 2 canonical work pages

  1. [1]

    WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  2. [2]

    Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  3. [3]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. 9

  4. [4]

    VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  5. [5]

    Laradji, Manuel Del Verme, Tom Marty, Leo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Leo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? InInternational Conference on Machine Learning, 2024

  6. [6]

    OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

  7. [7]

    OpAgent: Operator agent for web navigation

    Yuyu Guo, Wenjie Yang, Siyuan Yang, et al. OpAgent: Operator agent for web navigation. arXiv preprint arXiv:2602.13559, 2026

  8. [8]

    OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

    Xinyuan Wang, Bowen Wang, Dunjie Lu, et al. OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

  9. [9]

    UltraCUA: A foundation model for computer use agents with hybrid action.arXiv preprint arXiv:2510.17790, 2025

    Yuhao Yang, Zhen Yang, Zi-Yi Dou, et al. UltraCUA: A foundation model for computer use agents with hybrid action.arXiv preprint arXiv:2510.17790, 2025

  10. [10]

    InProceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024, Florian ’Floyd’ Mueller, Penny Kyburz, Julie R

    Yuexing Hao, Zeyu Liu, Bob Riter, and Saleh Kalantari. Advancing patient-centered shared decision-making with AI systems for older adult cancer patients. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–19, 2024. doi: 10.1145/3613904. 3642353

  11. [11]

    Waddle, Brian J

    Yuexing Hao, Jason Holmes, Mark R. Waddle, Brian J. Davis, Nathan Y . Yu, Kristin Vickers, Heather Preston, Drew Margolin, Corinna E. Lockenhoff, Aditya Vashistha, Saleh Kalantari, Marzyeh Ghassemi, and Wei Liu. Personalizing prostate cancer education for patients using an EHR-integrated LLM agent.npj Digital Medicine, 2025

  12. [12]

    Stern, and Marzyeh Ghassemi

    Yuexing Hao, Kumail Alhamoud, Hyewon Jeong, Haoran Zhang, Isha Puri, Philip Torr, Mike Schaekermann, Ariel D. Stern, and Marzyeh Ghassemi. MedPAIR: Measuring physicians and AI relevance alignment in medical question answering.arXiv preprint arXiv:2505.24040, 2025

  13. [13]

    MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025

    Xiaomin Li, Mingye Gao, Yuexing Hao, Taoran Li, Guangya Wan, Zihan Wang, and Yijun Wang. MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025

  14. [14]

    Selection of LLM fine-tuning data based on orthogonal rules.arXiv preprint arXiv:2410.04715, 2024

    Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, and Hong Hu. Selection of LLM fine-tuning data based on orthogonal rules.arXiv preprint arXiv:2410.04715, 2024

  15. [15]

    Data-adaptive safety rules for training reward models.arXiv preprint arXiv:2501.15453, 2025

    Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, and Weiyu Li. Data-adaptive safety rules for training reward models.arXiv preprint arXiv:2501.15453, 2025

  16. [16]

    Agent workflow memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

  17. [17]

    Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su

    Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. SkillWeaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025

  18. [18]

    AutoManual: Constructing instruction manuals by LLM agents via interactive environmental learning

    Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. AutoManual: Constructing instruction manuals by LLM agents via interactive environmental learning. In Advances in Neural Information Processing Systems, 2024

  19. [19]

    Tarr, William W

    Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, and Katerina Fragkiadaki. VLM agents generate their own memories: Distilling experience into embodied programs of thought. InAdvances in Neural Information Processing Systems, 2024. 10

  20. [20]

    LearnAct: Few-shot mobile GUI agent with a unified demonstration benchmark.arXiv preprint arXiv:2504.13805, 2025

    Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiyong Chen, Yuning Chai, Shuai Ren, Hao Wang, Shixiang He, and Wanli Meng. LearnAct: Few-shot mobile GUI agent with a unified demonstration benchmark.arXiv preprint arXiv:2504.13805, 2025

  21. [21]

    Open-world skill discovery from unsegmented demonstration videos

    Jingwen Deng, Zihao Wang, Shaofei Cai, Anji Liu, and Yitao Liang. Open-world skill discovery from unsegmented demonstration videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10708–10718, 2025

  22. [22]

    Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

    Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

  23. [23]

    The option-critic architecture

    Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI Conference on Artificial Intelligence, 2017

  24. [24]

    Kulkarni, Karthik R

    Tejas D. Kulkarni, Karthik R. Narasimhan, Ardavan Saeedi, and Joshua B. Tenenbaum. Hierar- chical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. InAdvances in Neural Information Processing Systems, 2016

  25. [25]

    FeUdal networks for hierarchical reinforcement learning

    Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 3540–3549. PMLR, 2017

  26. [26]

    Data-efficient hierarchical reinforcement learning

    Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. InAdvances in Neural Information Processing Systems, 2018

  27. [27]

    Learning abstract options

    Matthew Riemer, Miao Liu, and Gerald Tesauro. Learning abstract options. InAdvances in Neural Information Processing Systems, 2018

  28. [28]

    Hierarchical reinforcement learning with advantage-based auxiliary rewards

    Siyuan Li, Rui Wang, Minxue Tang, and Chongjie Zhang. Hierarchical reinforcement learning with advantage-based auxiliary rewards. InAdvances in Neural Information Processing Systems, 2019

  29. [29]

    Meta learning shared hierarchies

    Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. InInternational Conference on Learning Representations, 2018

  30. [30]

    OPAL: Offline primitive discovery for accelerating offline reinforcement learning

    Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. OPAL: Offline primitive discovery for accelerating offline reinforcement learning. InInternational Conference on Learning Representations, 2021

  31. [31]

    Variational intrinsic control

    Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. In International Conference on Learning Representations, 2017

  32. [32]

    Diversity is all you need: Learning skills without a reward function

    Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. InInternational Conference on Learning Representations, 2019

  33. [33]

    Dynamics- aware unsupervised discovery of skills

    Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics- aware unsupervised discovery of skills. InInternational Conference on Learning Representa- tions, 2020

  34. [34]

    Unsupervised reinforcement learning with contrastive intrinsic control

    Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Unsupervised reinforcement learning with contrastive intrinsic control. InAdvances in Neural Information Processing Systems, 2022

  35. [35]

    Learning actionable representations with goal-conditioned policies

    Dibya Ghosh, Abhishek Gupta, and Sergey Levine. Learning actionable representations with goal-conditioned policies. InInternational Conference on Learning Representations, 2019

  36. [36]

    The information geometry of unsupervised reinforcement learning

    Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. The information geometry of unsupervised reinforcement learning. InInternational Conference on Learning Representations, 2022. 11

  37. [37]

    DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning

    Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning. InAdvances in Neural Information Processing Systems, 2024

  38. [38]

    WebRL: Training LLM web agents via self-evolving online curriculum reinforcement learning

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenhan Zhao, Yuxiao Yang, Xiao Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong. WebRL: Training LLM web agents via self-evolving online curriculum reinforcement learning. InInternational Conference on Learning Representations, 2025

  39. [39]

    AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials

    Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials. In International Conference on Learning Representations, 2025

  40. [40]

    OS-Genesis: Automating GUI agent trajectory construction via reverse task synthesis

    Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, and Zhiyong Wu. OS-Genesis: Automating GUI agent trajectory construction via reverse task synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Lingui...

  41. [41]

    Proposer-agent-evaluator (PAE): Autonomous skill discovery for foundation model internet agents

    Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Erran Li. Proposer-agent-evaluator (PAE): Autonomous skill discovery for foundation model internet agents. InProceedings of the 42nd International Conference on Machine Learning, 2025

  42. [42]

    Skills-coach: A self-evolving skill optimizer via training-free GRPO.arXiv preprint arXiv:2604.27488, 2026

    Yu Tian, Jiawei Chen, Lifan Zheng, Mingxiang Tao, Xinyi Zeng, Zhaoxia Yin, Hang Su, and Xian Sun. Skills-coach: A self-evolving skill optimizer via training-free GRPO.arXiv preprint arXiv:2604.27488, 2026

  43. [43]

    Selective review of offline change point detection methods

    Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selective review of offline change point detection methods.Signal Processing, 167:107299, 2020. doi: 10.1016/j.sigpro.2019.107299

  44. [44]

    D. C. Dowson and B. V . Landau. The Fréchet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12(3):450–455, 1982

  45. [45]

    Computational optimal transport: With applications to data science.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

    Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

  46. [46]

    Sinkhorn distances: Lightspeed computation of optimal transportation distances

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation distances. InAdvances in Neural Information Processing Systems, pages 2292–2300, 2013

  47. [47]

    held-out benchmark correctness

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems, 2020. 12 A Appendix A.1 Additional GRPO Training Sessions We also run a scale-control GRPO session on Llama-3.1-70B-Instruct with quantized ...