arxiv: 2605.06230 · v2 · submitted 2026-05-07 · 💻 cs.AI · cs.DC

Recognition: 2 theorem links

· Lean Theorem

Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

Xinquan Chen , Zhenyun Yin , Shan He , Bin Huang , Shanzhe Lei , Pengcheng Shi , Kun Cai , Bei Chen

show 31 more authors

Bangwei Liu Zeyu Kang Chao Huang Yang Zhang Wenjie Li Ruijun Ge Yajie Wang Tianshun Fang Tianyang Xu Yiwen Cong Meng Jin Gaolei Li Xuansheng Wu Linhan Liu Zijing He An Li Yan Teng Xin Tan Chaochao Lu Ji He Jie Li Chunfeng Song Jinya Xu Fan Song Shujie Wang Jianmin Qian Jie Hou Xuhong Wang Yingchun Wang Hui Wang Xia Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:53 UTC · model grok-4.3

classification 💻 cs.AI cs.DC

keywords Sfactoryautonomous agentstrustworthy AIagent infrastructurereinforcement learningsimulation platformevolutionary pipelineclosed-loop training

0 comments

The pith

Safactory integrates parallel simulation, trustworthy data handling, and autonomous evolution into one closed-loop pipeline for training reliable agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Safactory to solve fragmentation in existing agent systems, where evaluation, data, and evolution remain separate. It proposes a single infrastructure that runs simulations to generate trajectories, stores and extracts experiences from those trajectories, and then uses asynchronous reinforcement learning plus distillation to evolve the agents. The goal is systematic risk discovery followed by ongoing improvement without manual handoffs between stages. A sympathetic reader would care because long-horizon autonomous agents currently lack reliable ways to surface safety issues before real-world deployment.

Core claim

Safactory is the first framework to propose a unified evolutionary pipeline for next-generation trustworthy autonomous intelligence by tightly coupling a Parallel Simulation Platform for trajectory generation, a Trustworthy Data Platform for trajectory storage and experience extraction, and an Autonomous Evolution Platform for asynchronous reinforcement learning and on-policy distillation.

What carries the argument

The Safactory framework formed by tight integration of the Parallel Simulation Platform, Trustworthy Data Platform, and Autonomous Evolution Platform to create a single closed evolutionary loop.

Load-bearing premise

Tightly integrating the Parallel Simulation Platform, Trustworthy Data Platform, and Autonomous Evolution Platform will systematically discover risks and enable continuous closed-loop improvement of autonomous agents.

What would settle it

A controlled comparison showing that the integrated pipeline identifies no additional risks or produces no measurable performance gains over separate non-integrated simulation, data, and training systems when tested on the same long-horizon agent tasks.

read the original abstract

As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmented across evaluation, data management, and agent evolution, making it difficult to discover risks systematically and improve models in a continuous closed loop. In this report, we present \textbf{Safactory}, a scalable agent factory for trustworthy autonomous intelligence. Safactory integrates three tightly coupled platforms: a \textbf{Parallel Simulation Platform} for trajectory generation, a \textbf{Trustworthy Data Platform} for trajectory storage and experience extraction, and an \textbf{Autonomous Evolution Platform} for asynchronous reinforcement learning and on-policy distillation. As far as we know, Safactory is the first framework to propose a unified evolutionary pipeline for next-generation trustworthy autonomous intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Safactory sketches a three-platform pipeline for agent training but supplies no mechanisms, experiments, or results to show the loop actually closes or improves trustworthiness.

read the letter

The paper describes Safactory as an integrated setup with a Parallel Simulation Platform for trajectory generation, a Trustworthy Data Platform for storage and extraction, and an Autonomous Evolution Platform for async RL and distillation. It notes that existing agent infrastructures are fragmented and claims this is the first unified evolutionary pipeline for trustworthy autonomous intelligence. That framing correctly identifies a real coordination problem in long-horizon agent work. The high-level architecture is laid out clearly enough to understand the intended roles of each piece. Beyond that, the write-up stays at the level of naming components and stating goals. No interfaces between the platforms are specified, no data formats or risk metrics are given, and no feedback loop is detailed beyond the general idea that extraction feeds evolution. The abstract contains no experiments, benchmarks, ablations, or even toy runs that would let a reader check whether the integration reduces risks or outperforms separate tools. The central assumption—that tight coupling will systematically discover risks and drive continuous improvement—therefore rests on unshown mechanisms. Without those, the claim of a closed loop remains aspirational rather than demonstrated. This kind of infrastructure proposal could interest teams already building agent systems who are looking for architectural patterns to consider. It does not give enough concrete material for a reading group to analyze or replicate. I would not cite it in current form. The paper is not ready for peer review; an editor should ask for implementation details and at least preliminary validation before sending it out.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Safactory, a scalable agentic infrastructure that unifies three platforms—the Parallel Simulation Platform for generating trajectories, the Trustworthy Data Platform for storing and extracting experiences, and the Autonomous Evolution Platform for asynchronous RL and distillation—into a closed-loop system for training trustworthy autonomous agents. It positions this as the first such unified evolutionary pipeline to address fragmentation in agent evaluation, data management, and evolution.

Significance. Should the proposed integration prove effective, it could have substantial significance for the AI community by providing a framework for continuous improvement and risk mitigation in autonomous agents, which is a growing area of concern. The emphasis on trustworthiness and scalability addresses timely challenges in deploying agents in real environments. However, the current manuscript does not provide evidence to substantiate these benefits.

major comments (2)

[Abstract] The central claim that the tight integration of the three platforms enables 'systematic' risk discovery and 'continuous closed-loop improvement' is not supported by any description of the specific mechanisms, data schemas, feedback loops, or risk metrics involved. This absence makes the primary contribution difficult to assess or reproduce.
[Abstract] No experiments, benchmarks, ablations, or even toy examples are presented to demonstrate the framework's scalability or effectiveness in improving agent trustworthiness over existing fragmented approaches.

minor comments (2)

[Abstract] Typo: 'agenticinfrastructure' should be 'agentic infrastructure'.
[Abstract] Grammatical issue: 'Existing agenticinfrastructure remain fragmented' should use 'remains' since 'infrastructure' is treated as singular.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for recognizing the potential significance of Safactory in addressing fragmentation in agent training infrastructure. We address the major comments point by point below. Where the comments identify gaps in the original submission, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] The central claim that the tight integration of the three platforms enables 'systematic' risk discovery and 'continuous closed-loop improvement' is not supported by any description of the specific mechanisms, data schemas, feedback loops, or risk metrics involved. This absence makes the primary contribution difficult to assess or reproduce.

Authors: We agree that the abstract is high-level and does not enumerate these details. The body of the manuscript describes the platforms and their coupling, but we acknowledge the need for greater specificity to support the claims. In the revised version, we have expanded the abstract with a brief reference to the mechanisms and added a dedicated paragraph in Section 2 that specifies the data schemas (trajectory records with embedded safety annotations), feedback loops (experience extraction triggering asynchronous RL updates), and risk metrics (e.g., safety-violation frequency and long-horizon reward with penalty terms). A new diagram has also been included to illustrate the closed loop. revision: yes
Referee: [Abstract] No experiments, benchmarks, ablations, or even toy examples are presented to demonstrate the framework's scalability or effectiveness in improving agent trustworthiness over existing fragmented approaches.

Authors: This observation is correct; the original manuscript is a system-description paper and contains no empirical results. To address the concern, the revised manuscript now includes a new 'Preliminary Evaluation' section with two toy examples (a grid-world navigation task and a simple tool-use scenario). These demonstrate closed-loop improvement via reduced safety violations after one evolution cycle when using the integrated pipeline versus running the platforms independently. We also report basic scalability metrics for the Parallel Simulation Platform (trajectory throughput scaling linearly with worker count up to 128 cores). Comprehensive benchmarks on large models remain future work, as the infrastructure is still maturing. revision: yes

Circularity Check

0 steps flagged

No circularity: purely architectural description with no derivations or self-referential reductions

full rationale

The paper presents Safactory as an integration of three named platforms (Parallel Simulation for trajectories, Trustworthy Data for storage/extraction, Autonomous Evolution for async RL and distillation) and asserts it is the first unified evolutionary pipeline. No equations, fitted parameters, predictions, or derivation steps appear in the provided text. The central claim is a descriptive architecture plus a novelty assertion; it does not define any quantity in terms of itself, rename a fitted result as a prediction, or rely on self-citations for load-bearing uniqueness. The description is self-contained as an engineering proposal and contains no mathematical chain that could reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no mathematical content, free parameters, or explicit axioms; the central claim rests on the untested assumption that the described platform integration produces trustworthy autonomous intelligence.

pith-pipeline@v0.9.0 · 5576 in / 1045 out tokens · 50273 ms · 2026-05-11T00:53:24.456775+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Safactory integrates three tightly coupled platforms: a Parallel Simulation Platform for trajectory generation, a Trustworthy Data Platform for trajectory storage and experience extraction, and an Autonomous Evolution Platform for asynchronous reinforcement learning and on-policy distillation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the closed loop of discovering risks through execution, consolidating evidence through data, completing repairs through evolution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

101 extracted references · 39 canonical work pages · 24 internal anchors

[1]

Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page arXiv 2026
[2]

Introducing agent skills.https://claude.com/blog/skills, 2025

Anthropic. Introducing agent skills.https://claude.com/blog/skills, 2025

2025
[3]

Apache Airflow: A platform to programmatically author, schedule, and monitor workflows.https://github.com/apache/airflow, 2024

Apache. Apache Airflow: A platform to programmatically author, schedule, and monitor workflows.https://github.com/apache/airflow, 2024

2024
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Andy Jones, Kamile Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Deep Ganguli, Tom Henighan, Nicholas Joseph, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter

Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Fran- cisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th international workshop on semantic evaluation, pages 54–63, 2019

2019
[7]

Hurtlex: A multilingual lexicon of words to hurt

Elisa Bassignana, Valerio Basile, and Viviana Patti. Hurtlex: A multilingual lexicon of words to hurt. InProceedings of the fifth Italian conference on computational linguistics (CLiC-it 2018), pages 52–57, 2018

2018
[8]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

1901
[9]

Opendataarena: A fair and open arena for benchmarking post-training dataset value.arXiv preprint arXiv:2512.14051, 2025

Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, et al. Opendataarena: A fair and open arena for benchmarking post-training dataset value.arXiv preprint arXiv:2512.14051, 2025

work page arXiv 2025
[10]

Opendataarena: A fair and open arena for benchmarking post-training dataset value, 2025

Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, and Lijun Wu. Opendataarena: A fair and open arena for benchmarking post-training dataset value, 2025

2025
[11]

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for 41 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence jailbreaking large language ...

2024
[12]

Ghostei-bench: Do mobile agents resilience to environmental injection in dynamic on-device environments?arXiv preprint arXiv:2510.20333, 2025

Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, and Yingchun Wang. Ghostei-bench: Do mobile agents resilience to environmental injection in dynamic on-device environments?arXiv preprint arXiv:2510.20333, 2025

work page arXiv 2025
[13]

Data-Juicer: A one-stop data pro- cessing system for large language models

Daoyuan Chen, Yilun Huang, Zhijian Ma, et al. Data-Juicer: A one-stop data pro- cessing system for large language models. InProceedings of the 2024 ACM SIGMOD International Conference on Management of Data, 2024

2024
[14]

ELEPHANT: Measuring and understanding social sycophancy in LLMs

Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Elephant: Measuring and understanding social sycophancy in llms.arXiv preprint arXiv:2505.13995, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

2021
[16]

Biopython: freely available Python tools for computational molecular biology and bioinformatics.Bioinformatics, 25(11):1422–1423, 2009

Peter J A Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andreas Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics.Bioinformatics, 25(11):1422–1423, 2009

2009
[17]

DeepEval: The LLM evaluation framework, 2024

Confident AI. DeepEval: The LLM evaluation framework, 2024

2024
[18]

Dingo: A comprehensive ai data quality evaluation tool for large models.https://github.com/MigoXLab/dingo, 2024

Dingo Contributors. Dingo: A comprehensive ai data quality evaluation tool for large models.https://github.com/MigoXLab/dingo, 2024

2024
[19]

Dagster: An orchestration platform for the development, production, and observation of data assets.https://github.com/dagster-io/dagster, 2024

Dagster. Dagster: An orchestration platform for the development, production, and observation of data assets.https://github.com/dagster-io/dagster, 2024

2024
[20]

Bias detection with modernbert-large

Enric Junqu´ e de Fortuny. Bias detection with modernbert-large. 2025

2025
[21]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review arXiv 2024
[22]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

prompt-injections

Deepset. prompt-injections. https://huggingface.co/datasets/deepset/ prompt-injections, 2020. 42 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

2020
[26]

garak: A Framework for Security Probing Large Language Models

Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. garak: A Framework for Security Probing Large Language Models. 2024

2024
[27]

Hashimoto

Yann Dubois, Bal´ azs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length- controlled alpacaeval: A simple way to debias automatic evaluators, 2025

2025
[28]

Kernel samepage merging

Izik Eidus and Hugh Dickins. Kernel samepage merging. https://docs.kernel.org/ admin-guide/mm/ksm.html, 2009. Accessed: 2026

2009
[29]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298, 2025

work page internal anchor Pith review arXiv 2025
[30]

A framework for few-shot language model evaluation, 2021

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 2021

2021
[31]

Giskard Hub, 2024

Giskard AI. Giskard Hub, 2024

2024
[32]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5 Team. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review arXiv 2025
[33]

MNE software for processing MEG and EEG data.NeuroImage, 86:446–460, 2014

Alexandre Gramfort, Martin Luessi, Eric Larson, Denis A Engemann, Daniel Strohmeier, Christian Brodbeck, Lauri Parkkonen, and Matti S H¨ am¨ al¨ ainen. MNE software for processing MEG and EEG data.NeuroImage, 86:446–460, 2014

2014
[34]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

2024
[35]

Detoxify

Laura Hanu and Unitary team. Detoxify. https://github.com/unitaryai/detoxify, 2020

2020
[36]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th annual meeting of the association for computational linguistics, pages 3309–3326, 2022

2022
[37]

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.arXiv preprint arXiv:2111.09543, 2021

work page internal anchor Pith review arXiv 2021
[38]

Trl: Transformer reinforcement learning

Hugging Face. Trl: Transformer reinforcement learning. https://github.com/ huggingface/trl, 2025. 43 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

2025
[39]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Areal: Lightning-fast rl for llm reasoning and agents

inclusionAI. Areal: Lightning-fast rl for llm reasoning and agents. https://github. com/inclusionAI/AReaL, n.d
[41]

Perplexity—a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977

Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. Perplexity—a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977

1977
[42]

Riosworld: Benchmarking the risk of multimodal computer-use agents

Yang JingYi, Shuai Shao, Dongrui Liu, and Jing Shao. Riosworld: Benchmarking the risk of multimodal computer-use agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[43]

KEGG as a reference resource for gene and protein annotation.Nucleic Acids Research, 44(D1):D457–D462, 2016

Minoru Kanehisa, Yoko Sato, Masayuki Kawashima, Miho Furumichi, and Mao Tanabe. KEGG as a reference resource for gene and protein annotation.Nucleic Acids Research, 44(D1):D457–D462, 2016

2016
[44]

PubChem in 2021: new data content and improved web interfaces.Nucleic Acids Research, 49(D1):D1388–D1395, 2021

Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. PubChem in 2021: new data content and improved web interfaces.Nucleic Acids Research, 49(D1):D1388–D1395, 2021

2021
[45]

Kimi K2: Open Agentic Intelligence

Kimi Team. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

RDKit: Open-source cheminformatics

Greg Landrum et al. RDKit: Open-source cheminformatics. http://www.rdkit.org,
[47]

Langfuse: Open source LLM engineering platform, 2024

Langfuse. Langfuse: Open source LLM engineering platform, 2024

2024
[48]

Piguard: Prompt injection guardrail via mitigating overdefense for free

Hao Li, Xiaogeng Liu, Ning Zhang, and Chaowei Xiao. Piguard: Prompt injection guardrail via mitigating overdefense for free. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 30420–30437, 2025

2025
[49]

From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning

Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

2024
[50]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, 44 Safactory: A Scalable Agentic Infr...

work page internal anchor Pith review arXiv 2026
[51]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

2022
[52]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.arXiv preprint arXiv:2312.15685, 2023

work page arXiv 2023
[53]

Arena learning: Build data flywheel for llms post-training via simulated chatbot arena.arXiv preprint arXiv:2407.10627, 2024

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jianguang Lou, Shifeng Chen, Yansong Tang, and Weizhu Chen. Arena learning: Build data flywheel for llms post-training via simulated chatbot arena.arXiv preprint arXiv:2407.10627, 2024

work page arXiv 2024
[54]

Media Bias Group. BABE. https://huggingface.co/datasets/mediabiasgroup/ BABE, 2020

2020
[55]

Merrill, Alex Shaw, et al

Mike A. Merrill, Alex Shaw, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces. 2026

2026
[56]

Presidio.https://github.com/microsoft/presidio, 2020

Microsoft. Presidio.https://github.com/microsoft/presidio, 2020

2020
[57]

MLflow: A machine learning lifecycle platform, 2024

MLflow. MLflow: A machine learning lifecycle platform, 2024

2024
[58]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Moonshot AI. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Nvidia nemo curator.https://github.com/NVIDIA-NeMo/Curator, 2024

NVIDIA. Nvidia nemo curator.https://github.com/NVIDIA-NeMo/Curator, 2024

2024
[60]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

OpenAI Evals, 2023

OpenAI. OpenAI Evals, 2023

2023
[62]

OpenCompass: A universal evaluation platform for foundation models, 2023

OpenCompass Contributors. OpenCompass: A universal evaluation platform for foundation models, 2023

2023
[63]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

OpenRLHF Team. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

work page internal anchor Pith review arXiv 2024
[64]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

2022
[65]

The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

Guilherme Penedo, Hynek Kydl´ ıˇ cek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024. 45 Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous...

2024
[66]

Discovering language model behaviors with model-written evaluations

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. InFindings of the association for computational linguistics, pages 13387–13434, 2023

2023
[67]

Pinchbench skill: Benchmark runner and task definitions for openclaw agents

PinchBench Team. Pinchbench skill: Benchmark runner and task definitions for openclaw agents. https://github.com/pinchbench/skill, 2026. GitHub repository

2026
[68]

Prefect: The new standard in dataflow automation

Prefect. Prefect: The new standard in dataflow automation. https://github.com/ PrefectHQ/prefect, 2024

2024
[69]

promptfoo: Test and evaluate LLMs, 2024

promptfoo. promptfoo: Test and evaluate LLMs, 2024

2024
[70]

Qwen2 Technical Report

Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery.arXiv preprint arXiv:2602.09132, 2026

Jiyong Rao et al. SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery.arXiv preprint arXiv:2602.09132, 2026

work page arXiv 2026
[73]

Agent lightning: Train any ai agents with reinforcement learning,

RollArt Team. Rollart: Scaling agentic rl training via disaggregated infrastructure. arXiv preprint arXiv:2508.03680, 2025

work page arXiv 2025
[74]

Chi, James Caverlee, Julian J

Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms.arXiv preprint arXiv:2402.09668, 2024

work page arXiv 2024
[75]

DeepLink.https://github.com/DeepLink-org, 2023

Shanghai AI Laboratory. DeepLink.https://github.com/DeepLink-org, 2023

2023
[76]

Deeplink: Artificial intelligence open computing system

Shanghai AI Laboratory. Deeplink: Artificial intelligence open computing system. https://deeplink.org.cn/home, 2023

2023
[77]

Merrill, et al

Alex Shaw, Mike A. Merrill, et al. Harbor: A framework for running agent evaluations and creating RL environments, 2025

2025
[78]

Predictive data selection: The data that predicts is the data that teaches.arXiv preprint arXiv:2503.00808, 2025

Kashun Shum, Yuzhen Huang, Hongjian Zou, Qi Ding, Yixuan Liao, Xiaoxin Chen, Qian Liu, and Junxian He. Predictive data selection: The data that predicts is the data that teaches.arXiv preprint arXiv:2503.00808, 2025

work page arXiv 2025
[79]

Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310, 2025

Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310, 2025

work page arXiv 2025
[80]

Multipriv: Benchmark- ing individual-level privacy reasoning in vision-language models.arXiv preprint arXiv:2511.16940, 2025

Xiongtao Sun, Hui Li, Jiaming Zhang, Yujie Yang, et al. Multipriv: Benchmark- ing individual-level privacy reasoning in vision-language models.arXiv preprint arXiv:2511.16940, 2025

work page arXiv 2025

Showing first 80 references.