pith. machine review for the scientific record. sign in

arxiv: 2605.06230 · v2 · submitted 2026-05-07 · 💻 cs.AI · cs.DC

Recognition: 2 theorem links

· Lean Theorem

Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:53 UTC · model grok-4.3

classification 💻 cs.AI cs.DC
keywords Sfactoryautonomous agentstrustworthy AIagent infrastructurereinforcement learningsimulation platformevolutionary pipelineclosed-loop training
0
0 comments X

The pith

Safactory integrates parallel simulation, trustworthy data handling, and autonomous evolution into one closed-loop pipeline for training reliable agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Safactory to solve fragmentation in existing agent systems, where evaluation, data, and evolution remain separate. It proposes a single infrastructure that runs simulations to generate trajectories, stores and extracts experiences from those trajectories, and then uses asynchronous reinforcement learning plus distillation to evolve the agents. The goal is systematic risk discovery followed by ongoing improvement without manual handoffs between stages. A sympathetic reader would care because long-horizon autonomous agents currently lack reliable ways to surface safety issues before real-world deployment.

Core claim

Safactory is the first framework to propose a unified evolutionary pipeline for next-generation trustworthy autonomous intelligence by tightly coupling a Parallel Simulation Platform for trajectory generation, a Trustworthy Data Platform for trajectory storage and experience extraction, and an Autonomous Evolution Platform for asynchronous reinforcement learning and on-policy distillation.

What carries the argument

The Safactory framework formed by tight integration of the Parallel Simulation Platform, Trustworthy Data Platform, and Autonomous Evolution Platform to create a single closed evolutionary loop.

Load-bearing premise

Tightly integrating the Parallel Simulation Platform, Trustworthy Data Platform, and Autonomous Evolution Platform will systematically discover risks and enable continuous closed-loop improvement of autonomous agents.

What would settle it

A controlled comparison showing that the integrated pipeline identifies no additional risks or produces no measurable performance gains over separate non-integrated simulation, data, and training systems when tested on the same long-horizon agent tasks.

read the original abstract

As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmented across evaluation, data management, and agent evolution, making it difficult to discover risks systematically and improve models in a continuous closed loop. In this report, we present \textbf{Safactory}, a scalable agent factory for trustworthy autonomous intelligence. Safactory integrates three tightly coupled platforms: a \textbf{Parallel Simulation Platform} for trajectory generation, a \textbf{Trustworthy Data Platform} for trajectory storage and experience extraction, and an \textbf{Autonomous Evolution Platform} for asynchronous reinforcement learning and on-policy distillation. As far as we know, Safactory is the first framework to propose a unified evolutionary pipeline for next-generation trustworthy autonomous intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Safactory, a scalable agentic infrastructure that unifies three platforms—the Parallel Simulation Platform for generating trajectories, the Trustworthy Data Platform for storing and extracting experiences, and the Autonomous Evolution Platform for asynchronous RL and distillation—into a closed-loop system for training trustworthy autonomous agents. It positions this as the first such unified evolutionary pipeline to address fragmentation in agent evaluation, data management, and evolution.

Significance. Should the proposed integration prove effective, it could have substantial significance for the AI community by providing a framework for continuous improvement and risk mitigation in autonomous agents, which is a growing area of concern. The emphasis on trustworthiness and scalability addresses timely challenges in deploying agents in real environments. However, the current manuscript does not provide evidence to substantiate these benefits.

major comments (2)
  1. [Abstract] The central claim that the tight integration of the three platforms enables 'systematic' risk discovery and 'continuous closed-loop improvement' is not supported by any description of the specific mechanisms, data schemas, feedback loops, or risk metrics involved. This absence makes the primary contribution difficult to assess or reproduce.
  2. [Abstract] No experiments, benchmarks, ablations, or even toy examples are presented to demonstrate the framework's scalability or effectiveness in improving agent trustworthiness over existing fragmented approaches.
minor comments (2)
  1. [Abstract] Typo: 'agenticinfrastructure' should be 'agentic infrastructure'.
  2. [Abstract] Grammatical issue: 'Existing agenticinfrastructure remain fragmented' should use 'remains' since 'infrastructure' is treated as singular.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for recognizing the potential significance of Safactory in addressing fragmentation in agent training infrastructure. We address the major comments point by point below. Where the comments identify gaps in the original submission, we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] The central claim that the tight integration of the three platforms enables 'systematic' risk discovery and 'continuous closed-loop improvement' is not supported by any description of the specific mechanisms, data schemas, feedback loops, or risk metrics involved. This absence makes the primary contribution difficult to assess or reproduce.

    Authors: We agree that the abstract is high-level and does not enumerate these details. The body of the manuscript describes the platforms and their coupling, but we acknowledge the need for greater specificity to support the claims. In the revised version, we have expanded the abstract with a brief reference to the mechanisms and added a dedicated paragraph in Section 2 that specifies the data schemas (trajectory records with embedded safety annotations), feedback loops (experience extraction triggering asynchronous RL updates), and risk metrics (e.g., safety-violation frequency and long-horizon reward with penalty terms). A new diagram has also been included to illustrate the closed loop. revision: yes

  2. Referee: [Abstract] No experiments, benchmarks, ablations, or even toy examples are presented to demonstrate the framework's scalability or effectiveness in improving agent trustworthiness over existing fragmented approaches.

    Authors: This observation is correct; the original manuscript is a system-description paper and contains no empirical results. To address the concern, the revised manuscript now includes a new 'Preliminary Evaluation' section with two toy examples (a grid-world navigation task and a simple tool-use scenario). These demonstrate closed-loop improvement via reduced safety violations after one evolution cycle when using the integrated pipeline versus running the platforms independently. We also report basic scalability metrics for the Parallel Simulation Platform (trajectory throughput scaling linearly with worker count up to 128 cores). Comprehensive benchmarks on large models remain future work, as the infrastructure is still maturing. revision: yes

Circularity Check

0 steps flagged

No circularity: purely architectural description with no derivations or self-referential reductions

full rationale

The paper presents Safactory as an integration of three named platforms (Parallel Simulation for trajectories, Trustworthy Data for storage/extraction, Autonomous Evolution for async RL and distillation) and asserts it is the first unified evolutionary pipeline. No equations, fitted parameters, predictions, or derivation steps appear in the provided text. The central claim is a descriptive architecture plus a novelty assertion; it does not define any quantity in terms of itself, rename a fitted result as a prediction, or rely on self-citations for load-bearing uniqueness. The description is self-contained as an engineering proposal and contains no mathematical chain that could reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no mathematical content, free parameters, or explicit axioms; the central claim rests on the untested assumption that the described platform integration produces trustworthy autonomous intelligence.

pith-pipeline@v0.9.0 · 5576 in / 1045 out tokens · 50273 ms · 2026-05-11T00:53:24.456775+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

101 extracted references · 39 canonical work pages · 24 internal anchors

  1. [1]

    Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

    Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

  2. [2]

    Introducing agent skills.https://claude.com/blog/skills, 2025

    Anthropic. Introducing agent skills.https://claude.com/blog/skills, 2025

  3. [3]

    Apache Airflow: A platform to programmatically author, schedule, and monitor workflows.https://github.com/apache/airflow, 2024

    Apache. Apache Airflow: A platform to programmatically author, schedule, and monitor workflows.https://github.com/apache/airflow, 2024

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  5. [5]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Andy Jones, Kamile Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Deep Ganguli, Tom Henighan, Nicholas Joseph, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  6. [6]

    Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter

    Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Fran- cisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th international workshop on semantic evaluation, pages 54–63, 2019

  7. [7]

    Hurtlex: A multilingual lexicon of words to hurt

    Elisa Bassignana, Valerio Basile, and Viviana Patti. Hurtlex: A multilingual lexicon of words to hurt. InProceedings of the fifth Italian conference on computational linguistics (CLiC-it 2018), pages 52–57, 2018

  8. [8]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

  9. [9]

    Opendataarena: A fair and open arena for benchmarking post-training dataset value.arXiv preprint arXiv:2512.14051, 2025

    Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, et al. Opendataarena: A fair and open arena for benchmarking post-training dataset value.arXiv preprint arXiv:2512.14051, 2025

  10. [10]

    Opendataarena: A fair and open arena for benchmarking post-training dataset value, 2025

    Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, and Lijun Wu. Opendataarena: A fair and open arena for benchmarking post-training dataset value, 2025

  11. [11]

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for 41 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence jailbreaking large language ...

  12. [12]

    Ghostei-bench: Do mobile agents resilience to environmental injection in dynamic on-device environments?arXiv preprint arXiv:2510.20333, 2025

    Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, and Yingchun Wang. Ghostei-bench: Do mobile agents resilience to environmental injection in dynamic on-device environments?arXiv preprint arXiv:2510.20333, 2025

  13. [13]

    Data-Juicer: A one-stop data pro- cessing system for large language models

    Daoyuan Chen, Yilun Huang, Zhijian Ma, et al. Data-Juicer: A one-stop data pro- cessing system for large language models. InProceedings of the 2024 ACM SIGMOD International Conference on Management of Data, 2024

  14. [14]

    ELEPHANT: Measuring and understanding social sycophancy in LLMs

    Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Elephant: Measuring and understanding social sycophancy in llms.arXiv preprint arXiv:2505.13995, 2025

  15. [15]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

  16. [16]

    Biopython: freely available Python tools for computational molecular biology and bioinformatics.Bioinformatics, 25(11):1422–1423, 2009

    Peter J A Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andreas Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics.Bioinformatics, 25(11):1422–1423, 2009

  17. [17]

    DeepEval: The LLM evaluation framework, 2024

    Confident AI. DeepEval: The LLM evaluation framework, 2024

  18. [18]

    Dingo: A comprehensive ai data quality evaluation tool for large models.https://github.com/MigoXLab/dingo, 2024

    Dingo Contributors. Dingo: A comprehensive ai data quality evaluation tool for large models.https://github.com/MigoXLab/dingo, 2024

  19. [19]

    Dagster: An orchestration platform for the development, production, and observation of data assets.https://github.com/dagster-io/dagster, 2024

    Dagster. Dagster: An orchestration platform for the development, production, and observation of data assets.https://github.com/dagster-io/dagster, 2024

  20. [20]

    Bias detection with modernbert-large

    Enric Junqu´ e de Fortuny. Bias detection with modernbert-large. 2025

  21. [21]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  22. [22]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  23. [23]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  24. [24]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  25. [25]

    prompt-injections

    Deepset. prompt-injections. https://huggingface.co/datasets/deepset/ prompt-injections, 2020. 42 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

  26. [26]

    garak: A Framework for Security Probing Large Language Models

    Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. garak: A Framework for Security Probing Large Language Models. 2024

  27. [27]

    Hashimoto

    Yann Dubois, Bal´ azs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length- controlled alpacaeval: A simple way to debias automatic evaluators, 2025

  28. [28]

    Kernel samepage merging

    Izik Eidus and Hugh Dickins. Kernel samepage merging. https://docs.kernel.org/ admin-guide/mm/ksm.html, 2009. Accessed: 2026

  29. [29]

    AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298, 2025

  30. [30]

    A framework for few-shot language model evaluation, 2021

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 2021

  31. [31]

    Giskard Hub, 2024

    Giskard AI. Giskard Hub, 2024

  32. [32]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    GLM-4.5 Team. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471, 2025

  33. [33]

    MNE software for processing MEG and EEG data.NeuroImage, 86:446–460, 2014

    Alexandre Gramfort, Martin Luessi, Eric Larson, Denis A Engemann, Daniel Strohmeier, Christian Brodbeck, Lauri Parkkonen, and Matti S H¨ am¨ al¨ ainen. MNE software for processing MEG and EEG data.NeuroImage, 86:446–460, 2014

  34. [34]

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

  35. [35]

    Detoxify

    Laura Hanu and Unitary team. Detoxify. https://github.com/unitaryai/detoxify, 2020

  36. [36]

    Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th annual meeting of the association for computational linguistics, pages 3309–3326, 2022

  37. [37]

    DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

    Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.arXiv preprint arXiv:2111.09543, 2021

  38. [38]

    Trl: Transformer reinforcement learning

    Hugging Face. Trl: Transformer reinforcement learning. https://github.com/ huggingface/trl, 2025. 43 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

  39. [39]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  40. [40]

    Areal: Lightning-fast rl for llm reasoning and agents

    inclusionAI. Areal: Lightning-fast rl for llm reasoning and agents. https://github. com/inclusionAI/AReaL, n.d

  41. [41]

    Perplexity—a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977

    Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. Perplexity—a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977

  42. [42]

    Riosworld: Benchmarking the risk of multimodal computer-use agents

    Yang JingYi, Shuai Shao, Dongrui Liu, and Jing Shao. Riosworld: Benchmarking the risk of multimodal computer-use agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  43. [43]

    KEGG as a reference resource for gene and protein annotation.Nucleic Acids Research, 44(D1):D457–D462, 2016

    Minoru Kanehisa, Yoko Sato, Masayuki Kawashima, Miho Furumichi, and Mao Tanabe. KEGG as a reference resource for gene and protein annotation.Nucleic Acids Research, 44(D1):D457–D462, 2016

  44. [44]

    PubChem in 2021: new data content and improved web interfaces.Nucleic Acids Research, 49(D1):D1388–D1395, 2021

    Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. PubChem in 2021: new data content and improved web interfaces.Nucleic Acids Research, 49(D1):D1388–D1395, 2021

  45. [45]

    Kimi K2: Open Agentic Intelligence

    Kimi Team. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  46. [46]

    RDKit: Open-source cheminformatics

    Greg Landrum et al. RDKit: Open-source cheminformatics. http://www.rdkit.org,

  47. [47]

    Langfuse: Open source LLM engineering platform, 2024

    Langfuse. Langfuse: Open source LLM engineering platform, 2024

  48. [48]

    Piguard: Prompt injection guardrail via mitigating overdefense for free

    Hao Li, Xiaogeng Liu, Ning Zhang, and Chaowei Xiao. Piguard: Prompt injection guardrail via mitigating overdefense for free. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 30420–30437, 2025

  49. [49]

    From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning

    Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

  50. [50]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, 44 Safactory: A Scalable Agentic Infr...

  51. [51]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

  52. [52]

    What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

    Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.arXiv preprint arXiv:2312.15685, 2023

  53. [53]

    Arena learning: Build data flywheel for llms post-training via simulated chatbot arena.arXiv preprint arXiv:2407.10627, 2024

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jianguang Lou, Shifeng Chen, Yansong Tang, and Weizhu Chen. Arena learning: Build data flywheel for llms post-training via simulated chatbot arena.arXiv preprint arXiv:2407.10627, 2024

  54. [54]

    Media Bias Group. BABE. https://huggingface.co/datasets/mediabiasgroup/ BABE, 2020

  55. [55]

    Merrill, Alex Shaw, et al

    Mike A. Merrill, Alex Shaw, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces. 2026

  56. [56]

    Presidio.https://github.com/microsoft/presidio, 2020

    Microsoft. Presidio.https://github.com/microsoft/presidio, 2020

  57. [57]

    MLflow: A machine learning lifecycle platform, 2024

    MLflow. MLflow: A machine learning lifecycle platform, 2024

  58. [58]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Moonshot AI. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  59. [59]

    Nvidia nemo curator.https://github.com/NVIDIA-NeMo/Curator, 2024

    NVIDIA. Nvidia nemo curator.https://github.com/NVIDIA-NeMo/Curator, 2024

  60. [60]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  61. [61]

    OpenAI Evals, 2023

    OpenAI. OpenAI Evals, 2023

  62. [62]

    OpenCompass: A universal evaluation platform for foundation models, 2023

    OpenCompass Contributors. OpenCompass: A universal evaluation platform for foundation models, 2023

  63. [63]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    OpenRLHF Team. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

  64. [64]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  65. [65]

    The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

    Guilherme Penedo, Hynek Kydl´ ıˇ cek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024. 45 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous...

  66. [66]

    Discovering language model behaviors with model-written evaluations

    Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. InFindings of the association for computational linguistics, pages 13387–13434, 2023

  67. [67]

    Pinchbench skill: Benchmark runner and task definitions for openclaw agents

    PinchBench Team. Pinchbench skill: Benchmark runner and task definitions for openclaw agents. https://github.com/pinchbench/skill, 2026. GitHub repository

  68. [68]

    Prefect: The new standard in dataflow automation

    Prefect. Prefect: The new standard in dataflow automation. https://github.com/ PrefectHQ/prefect, 2024

  69. [69]

    promptfoo: Test and evaluate LLMs, 2024

    promptfoo. promptfoo: Test and evaluate LLMs, 2024

  70. [70]

    Qwen2 Technical Report

    Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

  71. [71]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  72. [72]

    SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery.arXiv preprint arXiv:2602.09132, 2026

    Jiyong Rao et al. SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery.arXiv preprint arXiv:2602.09132, 2026

  73. [73]

    Agent lightning: Train any ai agents with reinforcement learning,

    RollArt Team. Rollart: Scaling agentic rl training via disaggregated infrastructure. arXiv preprint arXiv:2508.03680, 2025

  74. [74]

    Chi, James Caverlee, Julian J

    Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms.arXiv preprint arXiv:2402.09668, 2024

  75. [75]

    DeepLink.https://github.com/DeepLink-org, 2023

    Shanghai AI Laboratory. DeepLink.https://github.com/DeepLink-org, 2023

  76. [76]

    Deeplink: Artificial intelligence open computing system

    Shanghai AI Laboratory. Deeplink: Artificial intelligence open computing system. https://deeplink.org.cn/home, 2023

  77. [77]

    Merrill, et al

    Alex Shaw, Mike A. Merrill, et al. Harbor: A framework for running agent evaluations and creating RL environments, 2025

  78. [78]

    Predictive data selection: The data that predicts is the data that teaches.arXiv preprint arXiv:2503.00808, 2025

    Kashun Shum, Yuzhen Huang, Hongjian Zou, Qi Ding, Yixuan Liao, Xiaoxin Chen, Qian Liu, and Junxian He. Predictive data selection: The data that predicts is the data that teaches.arXiv preprint arXiv:2503.00808, 2025

  79. [79]

    Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310, 2025

    Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310, 2025

  80. [80]

    Multipriv: Benchmark- ing individual-level privacy reasoning in vision-language models.arXiv preprint arXiv:2511.16940, 2025

    Xiongtao Sun, Hui Li, Jiaming Zhang, Yujie Yang, et al. Multipriv: Benchmark- ing individual-level privacy reasoning in vision-language models.arXiv preprint arXiv:2511.16940, 2025

Showing first 80 references.