Federation over Text: Insight Sharing for Multi-Agent Reasoning

Dixi Yao; Manzil Zaheer; Tahseen Rabbani; Tian Li

REVIEW 4 major objections 6 minor 54 references

LLM agents solving different tasks can build a shared library of reusable metacognitive insights by exchanging only summarized reasoning traces, without sharing problems or gradients.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.5

2026-07-12 19:14 UTC pith:JDLC77MY

load-bearing objection Clean semantic-FL protocol for multi-agent insight sharing with solid multi-domain gains; the 80% research-coverage claim is the softest part and rests on an unstable LLM judge. the 4 major comments →

arxiv 2604.16778 v2 pith:JDLC77MY submitted 2026-04-18 cs.LG cs.AI

Federation over Text: Insight Sharing for Multi-Agent Reasoning

Dixi Yao , Tahseen Rabbani , Manzil Zaheer , Tian Li This is my paper

classification cs.LG cs.AI

keywords federated learningmulti-agent reasoningLLM agentsinsight librarymetacognitionsemantic aggregationcross-domain transferreasoning traces

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Federation over Text (FoT), a framework that lets multiple independent LLM agents, each working on its own tasks, collaboratively construct a growing library of high-level reasoning insights. Agents perform local thinking and self-improvement, then send only concise, abstracted reasoning traces to a central server; the server clusters and distills those traces into cross-task, human-readable insights that are broadcast back for the next round. No raw problem instances, task instructions, or model parameters are ever shared, and no gradient updates or labeled supervision are required. Across math benchmarks, multi-domain collaboration, real-world daily agent tasks, and research-paper insight discovery, the resulting library raises average performance scores by about 25 percent while slightly cutting token use; a library built from one year’s ICLR papers covers the core technical contributions of more than 80 percent of the following year’s accepted papers.

Core claim

Federation over Text shows that independent LLM agents can iteratively federate at the pure semantic level—sharing only summarized reasoning traces—to produce a transferable insight library that improves both existing and future agents on related tasks, delivering measurable gains in accuracy and efficiency without any parameter aggregation or supervision signal.

What carries the argument

The insight library: a compact, server-distilled collection of explicit, cross-domain metacognitive principles obtained by clustering agents’ local reasoning traces and merging recurring skills into reusable step-by-step guidance.

Load-bearing premise

That server-side clustering and distillation prompts can reliably extract generalizable, non-noisy insights even when many of the uploaded traces come from incorrect local solutions.

What would settle it

Construct the insight library using only reasoning traces from problems the agents answered incorrectly; if the resulting library systematically lowers accuracy or true-thinking score on held-out tasks relative to isolated agents, the claim that noisy traces still yield useful insights is falsified.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Clean semantic-FL protocol for multi-agent insight sharing with solid multi-domain gains; the 80% research-coverage claim is the softest part and rests on an unstable LLM judge.

read the letter

The useful core of this paper is a practical protocol: agents keep raw problems local, upload short logic-level reasoning traces, and a server clusters and distills them into a reusable insight library. That is a real, usable primitive for heterogeneous LLM agents, and it is cleaner than just dumping trajectories into RAG or hand-writing skill files.

What they do well is the breadth of the empirical case. Math (LiveMathBench suite), multi-domain (math + GPQA + LiveCodeBench + HLE), OpenClaw/PinchBench daily tasks, plus transfer across models and to held-out tasks, ablations on local self-improvement methods and server aggregation, and basic privacy checks (prompt-stealing F1 < 0.25, near-zero 4-gram overlap). Gains are consistent: roughly +25% average score with a small token reduction on the first three apps, and the library still helps when many traces come from wrong answers. The FL analogy table is honest rather than decorative, and they ship enough prompts and settings that someone can re-run the idea.

The soft spot is concentrated, not fatal. The headline research claim—library from year-N ICLR papers covers core contributions of >80% of year-N+1 papers—depends entirely on Prompt 6 scored by Gemini. Cross-judge numbers swing hard (down to single digits with Qwen, mid-20s–40s with DeepSeek). There is no human IAA or expert audit of a random sample, only a qualitative appendix. So treat the 80% figure as an interesting LLM-as-judge signal, not as established scientific coverage. The math/multi-domain/OpenClaw results do not rest on that metric and still stand. Other ordinary caveats: commercial APIs, free parameters (library size, rounds), and the acknowledged risk that hallucinations can enter the library.

This is for people building multi-agent systems, federated or privacy-conscious agent setups, and anyone who wants a lightweight alternative to full trajectory stores. It is an engineering systems paper, not a foundational theory result, but the protocol is new enough and the evidence is broad enough that it deserves a serious referee. I would engage with it, cite the protocol and the non-research experiments, and discount the 80% claim until it has human validation.

Referee Report

4 major / 6 minor

Summary. The paper proposes Federation over Text (FoT), a federated-learning-inspired multi-agent framework in which clients solve heterogeneous tasks with local LLM agents, share only abstracted natural-language reasoning traces (not raw instances or instructions), and a server clusters and distills those traces into a reusable cross-task insight library. Unlike gradient-based FL or single-task multi-agent pipelines, FoT operates purely at inference time with no parameter updates or required success labels. The authors evaluate four applications—mathematical problem solving (LiveMathBench suite), multi-domain collaboration (math/science/coding and HLE), real-world OpenClaw/PinchBench tasks, and ML research insight discovery from ICLR papers—reporting average performance gains of about 25% with modest token reductions on the first three suites, plus a claim that year-N libraries cover core contributions of >80% of year-(N+1) papers. Ablations cover local reasoning strategies, server aggregation methods, library size, agent participation, heterogeneous models, and transfer to unseen tasks/models.

Significance. If the main empirical claims hold, FoT is a useful systems contribution: it reframes multi-agent collaboration as semantic-level federation of metacognitive traces, with practical benefits for efficiency, cross-domain transfer, and data minimization without fine-tuning. Strengths include a broad experimental suite (multiple base models, OpenClaw tool-use setting, transfer and participation studies), modularity (plug-in local self-improvement methods and aggregation strategies), and explicit privacy-oriented checks (prompt-stealing and n-gram overlap). The FL analogy (Table 1, Algorithms 1–2) is clarifying rather than decorative. The research-insight application is ambitious and potentially high-impact if the guidance metric can be made trustworthy; even without it, the math/multi-domain/daily-task results already support a solid systems paper on reusable insight libraries for agent reasoning.

major comments (4)

Abstract and §4.4 / Table 3: the headline claim that FoT insights cover >80% of major contributions in subsequent ICLR papers is load-bearing for the research application but rests almost entirely on an LLM-as-judge (Prompt 6) whose scores are highly judge-dependent. Table 13 shows the same FoT library dropping from ~67–82% under Gemini judges to ~4% (Qwen2.5-7B) or ~22–43% (DeepSeek-R1-7B), and isolated/RAG baselines also swing wildly. No human inter-annotator agreement, expert audit of a random sample of (paper, insight) pairs, or calibration against author-stated contributions is reported (only qualitative examples in Appendix H.1). Please either (i) add a human validation study with clear agreement metrics and revise the 80% claim to what that study supports, or (ii) substantially demote/qualify the research-guidance numbers in the abstract and main text and treat the application as
Abstract / §1 and §4.1–4.3: the aggregate claim of “improves average performance scores by 25% while reducing the reasoning tokens by 4% across the first three applications” needs a precise, reproducible definition. Gains vary sharply by setting (e.g., DeepSeek math Round 3 is 0.537→0.553; Gemini math 0.846→0.928; PinchBench Gemini-2.5 Flash Lite ~20→52; high-thinking Gemini-3.1 Pro ~76→85). State whether 25% is macro-average of relative improvements, absolute points, best-round only, or a weighted mix; report per-application relative/absolute deltas with the same seeds and rounds used in the average; and clarify whether token counts include reflection/trace/server aggregation or only final solving tokens (Table 10 suggests non-trivial extra cost). Without this, the abstract number is hard to audit against the tables.
§4.2 (HLE), §4.3 (Gemini-2.5 Flash Lite), and the axiom that server prompts distill generalizable insights even from mostly incorrect traces: the paper correctly notes robustness to noisy/failed traces, but does not analyze failure modes of aggregation (when noisy traces pollute the library, how often insights are decorative vs. causal, or whether clustering collapses to domain-specific tips). A small controlled study—e.g., inject known-bad traces, measure library quality and downstream accuracy, or ablate relationship construction (Prompt 4) vs. naive concatenation under high error rates—would make the central “semantic aggregation works without labels” claim more credible. This is especially important because FoT’s novelty relative to ExpeL/ACE/skills is precisely unsupervised cross-task distillation.
§4.3 / Appendix D.3.2 and §6.1: baseline fairness and scope. ExpeL is adapted to a train/eval task split and still relies on success labels and raw trajectories; skills and RAG baselines help, but FoT’s largest gains appear on heterogeneous PinchBench tasks where experience-reuse methods are a priori weaker. Please (i) state more clearly which claims are “cross-domain insight sharing” vs. “any memory of past trajectories,” (ii) report ExpeL-style experience banks built only from abstracted traces (not raw trajectories) as a closer control, and (iii) for math/multi-domain, include a simple “share all traces without server distillation” baseline at matched token budget in the main text (Figure 5 helps but is easy to miss). This would isolate the value of Prompts 4–5 aggregation from mere multi-agent memory.

minor comments (6)

Figure 2–4 and several tables use heavy redaction/placeholder styling in the manuscript text; ensure final figures have readable axis labels, error bars, and consistent round numbering (Round 1 = isolated).
Table 1 “Personalization” row is thin relative to the FL analogy; either expand with a concrete personalization experiment or mark it as future work more explicitly in §3.1.
§5 safeguarding: F1 < 0.25 and near-zero 4-gram Jaccard are encouraging; cite the exact attack protocol parameters and note residual risk that high-level skills can still leak task class (you already acknowledge non-zero similarity is expected).
§6.1 / Table 6: ACE and Evolving Prompts isolated baselines are very weak (0.37–0.38); briefly discuss hyperparameter search effort so readers do not over-read FoT’s relative lift on those methods.
Typos/style: “Federation overText” / spacing in title block; occasional “insight library,……” artifacts; unify “Round 2/3” reporting so best-round vs. fixed-round is always clear.
Related work: a short comparison to memory/skill banks (Anthropic Skills, Agents.md) is present; adding one paragraph on multi-agent experience sharing / trajectory stores would help position FoT for readers outside FL.

Circularity Check

0 steps flagged

No circular derivation chain; FoT is an empirical multi-agent systems paper whose claims rest on held-out benchmarks and subsequent-year papers, not on self-definitional or fitted-as-prediction steps.

full rationale

The paper proposes Federation over Text (FoT), an iterative procedure in which local LLM agents produce reasoning traces on their own tasks and a server aggregates those traces into a shared insight library via curated prompts (Prompts 2–5, Algorithms 1–2, Figure 1). There are no equations, fitted parameters, or uniqueness theorems whose outputs are then re-presented as independent predictions. Performance claims (25 % average score lift, 4 % token reduction) are measured on standard external benchmarks (LiveMathBench, AIME, GPQA, LiveCodeBench, PinchBench, HLE) under isolated-agent and multi-round FoT conditions; the insight library is never constructed from the evaluation labels themselves. The research-insight application builds the library from year-N ICLR papers and scores coverage of year-N+1 papers with an LLM judge (Prompt 6, Table 3); Setting 2 further uses a model whose cutoff precedes the evaluation papers. Cross-model judge variance (Table 13) is a validity concern, not circularity: the metric is not forced by construction from the same inputs that produced the library. Self-citations appear only in ordinary related-work positioning and do not underwrite any load-bearing uniqueness or ansatz claim. Consequently the derivation chain contains no self-definitional loop, fitted-input-called-prediction, or self-citation-load-bearing reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 2 invented entities

Empirical multi-agent systems paper. Load-bearing premises are domain assumptions about LLM reflection and prompt-based aggregation rather than free physical constants or invented particles. Design knobs (library size, rounds, temperature) are ablated rather than secretly fitted to the headline metrics.

free parameters (2)

insight library size
Default ~20 insights chosen after ablation showing diminishing returns beyond that point; treated as a controllable hyperparameter rather than a fitted constant of nature.
number of federation rounds
Typically 2–3 rounds; performance plateaus observed and reported, not optimized post-hoc against a hidden test set.

axioms (3)

domain assumption LLM agents can produce useful abstracted reasoning traces from their own solutions without ground-truth correctness labels.
Stated as a design premise; experiments include settings where most traces come from incorrect answers yet still improve the library.
ad hoc to paper Prompt-based clustering and distillation on the server yields generalizable cross-task insights rather than noise or memorized answers.
Core aggregation mechanism; ablated against simple concatenation, Chain-of-Density, and context compaction.
domain assumption LLM-as-judge can reliably decide whether an insight 'guides' the core contribution of a later paper under the four stated criteria.
Used for the research-discovery application; authors provide multi-model cross-validation and manual spot-checks.

invented entities (2)

insight library no independent evidence
purpose: Shared, evolving repository of high-level metacognitive principles distilled from multi-agent reasoning traces.
Central artifact of the framework; defined operationally by the aggregation prompts.
reasoning trace (logic-level summary) no independent evidence
purpose: Privacy-preserving, reusable abstraction of an agent's problem-solving process that omits raw instances.
Defined by Prompt 3; authors measure leakage via prompt-stealing and n-gram overlap.

pith-pipeline@v1.1.0-grok45 · 59353 in / 2443 out tokens · 40396 ms · 2026-07-12T19:14:40.905356+00:00 · methodology

0 comments

read the original abstract

We propose a federated learning-like framework, Federation over Text (FoT), that enables multiple clients solving different tasks to collectively generate a shared library of metacognitive insights by iteratively federating their local reasoning processes without sharing actual problem instances or task instructions. Instead of federation over gradients (e.g., as in distributed training), FoT operates at the semantic level without any gradient optimization or supervision signal. Iteratively, each client runs an LLM agent that does local thinking and self-improvement on their specific tasks independently, and shares reasoning traces with a central server, which aggregates and distills them into a cross-task (and cross-domain) insight library that existing and future agents can leverage to improve performance on related tasks. Experiments show that FoT improves reasoning effectiveness and efficiency across a wide range of challenging applications, including mathematical problem solving, cross-domain collaboration, real-world daily tasks, and machine learning research insight discovery. Specifically, it improves average performance scores by 25% while reducing the reasoning tokens by 4% across the first three applications. In the research insight discovery application, FoT is able to generate insights that cover over 80% of the major contributions in the subsequent papers.

Figures

Figures reproduced from arXiv: 2604.16778 by Dixi Yao, Manzil Zaheer, Tahseen Rabbani, Tian Li.

**Figure 2.** Figure 2: Comparison of reasoning accuracies and efficiency (mea [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Comparisons of the true-thinking score (Zhao et al., 2025), measuring reasoning steps whose removal would lead the agent to a different answer. FoT exhibits fewer decorative reasoning steps. With the constraints, the agent no longer hallucinates the position of OH (in Round 2 of FoT), but correctly identifies the Karplus relationship (in Round 3). The problem instance, our summary of reasoning traces in … view at source ↗

**Figure 5.** Figure 5: Comparison of reasoning efficiency and accuracy, averaged [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy and number of output tokens using [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Comparison of FoT, isolated agents, and RAG in their ability to generate insights that can cover the core [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Paper guidance rate across different levels of local agent participation where each agent reads one paper. Guidance rate increases as participation increases (i.e., more input tasks). 2260 452 226 91 Numbers of Total Agents 70 75 80 85 90 95 100 Guidance Rate (%) Oral Spotlight Poster All [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 11 linked inside Pith

[1]

Answering questions by meta-reasoning over multiple chains of thought

Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. Answering questions by meta-reasoning over multiple chains of thought. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5942–5966, 2023

2023
[2]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 17889– 17904, 2024

2024
[3]

The ai scientist: Towards fully automated open-ended scientific discovery, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. arXiv:2408.06292

Pith/arXiv arXiv 2024
[4]

Agent laboratory: Using LLM agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043. Association for Computational Linguistics, 2025

2025
[5]

Researchtown: Simulator of human research community

Haofei Yu, Zhaochen Hong, Zirui Cheng, Kunlun Zhu, Keyang Xuan, Jinwei Yao, Tao Feng, and Jiaxuan You. Researchtown: Simulator of human research community. InProceedings of the International Conference on Machine Learning (ICML), pages 73051–73096, 2025

2025
[6]

Vibe researching as wolf coming: Can AI agents with skills replace or augment social scientists?,

Yongjun Zhang. Vibe researching as wolf coming: Can AI agents with skills replace or augment social scientists?,
[7]

A survey of frontiers in LLM reasoning: Inference scaling, learning to reason, and agentic systems, 2025

Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, et al. A survey of frontiers in LLM reasoning: Inference scaling, learning to reason, and agentic systems, 2025. arXiv:2504.09037

arXiv 2025
[8]

Communication- efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication- efficient learning of deep networks from decentralized data. InProceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017

2017
[9]

Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems (NeurIPS), 37:132208–132237, 2024

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan ¨O Arık. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems (NeurIPS), 37:132208–132237, 2024

2024
[10]

Multi-agent collaboration mechanisms: A survey of LLMs, 2025

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of LLMs, 2025. arXiv:2501.06322

Pith/arXiv arXiv 2025
[11]

Multi-agent collaboration via evolving orchestration

Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, et al. Multi-agent collaboration via evolving orchestration. InProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2025

2025
[12]

The station: An open-world environment for AI-driven discovery, 2025

Stephen Chung and Wenyu Du. The station: An open-world environment for AI-driven discovery, 2025. arXiv:2511.06309

arXiv 2025
[13]

Evoscientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, et al. Evoscientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026. arXiv:2603.08127. 10 Federation over Text: Insight Sharing for Multi-Agent ReasoningA PREPRINT

arXiv 2026
[14]

Shall we team up: Exploring spontaneous cooperation of competing LLM agents

Zengqing Wu, Run Peng, Shuyuan Zheng, Qianying Liu, Xu Han, Brian I Kwon, Makoto Onizuka, Shaojie Tang, and Chuan Xiao. Shall we team up: Exploring spontaneous cooperation of competing LLM agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 5163–5186, 2024

2024
[15]

More agents is all you need.Transactions on Machine Learning Research (TMLR), 2024

Qin Zhang, Yangbin Yu, Qiang Fu, Deheng Ye, et al. More agents is all you need.Transactions on Machine Learning Research (TMLR), 2024

2024
[16]

Thought communi- cation in multiagent collaboration

Yujia Zheng, Zhuokai Zhao, Zijian Li, Yaqi Xie, Mingze Gao, Lizhu Zhang, and Kun Zhang. Thought communi- cation in multiagent collaboration. InProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2025

2025
[17]

aiXiv: A next-generation open access ecosystem for scientific discovery generated by AI scientists, 2025

Pengsong Zhang, Xiang Hu, Guowei Huang, Yang Qi, Heng Zhang, Xiuxu Li, Jiaxing Song, Jiabin Luo, Yijiang Li, Shuo Yin, et al. aiXiv: A next-generation open access ecosystem for scientific discovery generated by AI scientists, 2025. arXiv:2508.15126

arXiv 2025
[18]

Fedcot: Communication-efficient federated reasoning enhancement for large language models, 2025

Chuan Li, Qianyi Zhao, Fengran Mo, and Cen Chen. Fedcot: Communication-efficient federated reasoning enhancement for large language models, 2025. arXiv:2508.10020

Pith/arXiv arXiv 2025
[19]

Social learning: Towards collaborative learning with large language models, 2023

Amirkeivan Mohtashami, Florian Hartmann, Sian Gooding, Lukas Zilka, Matt Sharifi, et al. Social learning: Towards collaborative learning with large language models, 2023. arXiv:2312.11441

Pith/arXiv arXiv 2023
[20]

Skills (agent skills repository).��, n.d

Anthropic. Skills (agent skills repository).��, n.d. GitHub reposi- tory
[21]

Claude scientific skills: A comprehensive collection of scientific tools for claude ai.�� , 2026

K-Dense Inc. Claude scientific skills: A comprehensive collection of scientific tools for claude ai.�� , 2026. skills covering databases, packages, inte- grations, and analysis tools

2026
[22]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

2024
[23]

Metacognitive reuse: Turning recurring LLM reasoning into concise behaviors, 2025

Aniket Didolkar, Nicolas Ballas, Sanjeev Arora, and Anirudh Goyal. Metacognitive reuse: Turning recurring LLM reasoning into concise behaviors, 2025. arXiv:2509.13237

arXiv 2025
[24]

Test-time recursive thinking: Self-improvement without external feedback, 2026

Yufan Zhuang, Chandan Singh, Liyuan Liu, Yelong Shen, Dinghuai Zhang, Jingbo Shang, Jianfeng Gao, and Weizhu Chen. Test-time recursive thinking: Self-improvement without external feedback, 2026. arXiv:2602.03094

arXiv 2026
[25]

Hyperagents, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents, 2026. arXiv:2603.19461

arXiv 2026
[26]

Agentic context engineering: Evolving contexts for self- improving language models, 2025

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rain- ton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self- improving language models, 2025. arXiv:2510.04618

Pith/arXiv arXiv 2025
[27]

G ¨odel agent: A self-referential agent framework for recursively self-improvement

Xunjian Yin, Xinyi Wang, Liangming Pan, Li Lin, Xiaojun Wan, and William Yang Wang. G ¨odel agent: A self-referential agent framework for recursively self-improvement. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 27890–27913, 2025

2025
[28]

Evolving prompts in-context: An open-ended, self-replicating perspective

Jianyu Wang, Zhiqiang Hu, and Lidong Bing. Evolving prompts in-context: An open-ended, self-replicating perspective. InProceedings of the International Conference on Machine Learning (ICML), pages 63036–63087, 2025

2025
[29]

Self-adapting language models

Adam Zweiger, Jyothish Pari, Han Guo, Yoon Kim, and Pulkit Agrawal. Self-adapting language models. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2025

2025
[30]

Meta context engineering via agentic skill evolution, 2026

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta context engineering via agentic skill evolution, 2026. arXiv:2601.21557

arXiv 2026
[31]

Reasoning pattern alignment merging for adaptive reasoning, 2026

Zhaofeng Zhong, Wei Yuan, Tong Chen, Xiangyu Zhao, Quoc Viet Hung Nguyen, and Hongzhi Yin. Reasoning pattern alignment merging for adaptive reasoning, 2026. arXiv:2601.03506

arXiv 2026
[32]

OpenClaw-RL: Train any agent simply by talking, 2026

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. OpenClaw-RL: Train any agent simply by talking, 2026. arXiv:2603.10165

Pith/arXiv arXiv 2026
[33]

From sparse to dense: Gpt-4 summarization with chain of density prompting

Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, and No ´emie Elhadad. From sparse to dense: Gpt-4 summarization with chain of density prompting. InProceedings of the 4th New Frontiers in Summarization Workshop, pages 68–74, 2023

2023
[34]

Claude cookbooks: Examples and guides for building with claude.�� , 2024

Anthropic. Claude cookbooks: Examples and guides for building with claude.�� , 2024. Accessed: 2026-03-10. 11 Federation over Text: Insight Sharing for Multi-Agent ReasoningA PREPRINT

2024
[35]

American invitational mathematics examination (AIME) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (AIME) 2024, 2024

2024
[36]

American invitational mathematics examination (AIME) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (AIME) 2025, 2025

2025
[37]

Are your LLMs capable of stable reasoning? InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 17594–17632, 2025

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your LLMs capable of stable reasoning? InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 17594–17632, 2025

2025
[38]

Can aha moments be fake? identifying true and decora- tive thinking steps in chain-of-thought, 2025

Jiachen Zhao, Yiyou Sun, Weiyan Shi, and Dawn Song. Can aha moments be fake? identifying true and decora- tive thinking steps in chain-of-thought, 2025. arXiv:2510.24941

Pith/arXiv arXiv 2025
[39]

GPQA: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof q&a benchmark. InProceedings of the First Conference on Language Modeling (COLM), 2024

2024
[40]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025
[41]

Humanity’s last exam, 2025

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, and et al. Humanity’s last exam, 2025. arXiv:2501.14249

Pith/arXiv arXiv 2025
[42]

Openclaw: Your personal open-source AI assistant.�� , 2025

OpenClaw Contributors. Openclaw: Your personal open-source AI assistant.�� , 2025. Accessed: 2026-04-21

2025
[43]

Pinchbench: Benchmarking LLM agents on real-world tasks, 2026

Kilo.ai. Pinchbench: Benchmarking LLM agents on real-world tasks, 2026. URL��

2026
[44]

Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation

Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bod- hisattwa Prasad Majumder, Daniel S Weld, and Peter Clark. Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13370–13467, 2025

2025
[45]

Deepscientist: Advanc- ing frontier-pushing scientific findings progressively, 2025

Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. Deepscientist: Advanc- ing frontier-pushing scientific findings progressively, 2025. arXiv:2509.26603

arXiv 2025
[46]

Aris: Fully autonomous research via adversarial multi-agent collabo- ration.��, 2026

Ruofeng Yang, Yongcan Li, and Shuai Li. Aris: Fully autonomous research via adversarial multi-agent collabo- ration.��, 2026

2026
[47]

Dingjie Song, Hanrong Zhang, Dawei Liu, Yixin Liu, Zongxia Li, Zhengqing Yuan, Siqi Zhang, and Lichao Sun. Dr. claw: An AI research workspace from idea to paper.��, 2026

2026
[48]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InProceedings of the Conference on Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

2020
[49]

Weak-to-strong generalization: Eliciting strong capa- bilities with weak supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-strong generalization: Eliciting strong capa- bilities with weak supervision. InProceedings of the International Conference on Machine Learning (ICML), 2024

2024
[50]

Claw-eval: Toward trustworthy evaluation of autonomous agents, 2026

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Ling- peng Kong, and et al. Claw-eval: Toward trustworthy evaluation of autonomous agents, 2026. arXiv:2604.06132

Pith/arXiv arXiv 2026
[51]

Prompt stealing attacks against large language models, 2024

Zeyang Sha and Yang Zhang. Prompt stealing attacks against large language models, 2024. arXiv:2402.12959

Pith/arXiv arXiv 2024
[52]

Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics (TACL), 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics (TACL), 12:157–173, 2024

2024
[53]

Meta-harness: End-to-end optimization of model harnesses, 2026

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. arXiv:2603.28052

Pith/arXiv arXiv 2026
[54]

Include One

Eva Y Puspaningrum, Budi Nugroho, Ariyono Setiawan, and Nuraini Hariyanti. Detection of text similarity for indication plagiarism using winnowing algorithm based k-gram and jaccard coefficient. InJournal of physics: Conference series, volume 1569, page 022044. IOP Publishing, 2020. 12 Federation over Text: Insight Sharing for Multi-Agent ReasoningA PREPRI...

arXiv 2020

[1] [1]

Answering questions by meta-reasoning over multiple chains of thought

Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. Answering questions by meta-reasoning over multiple chains of thought. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5942–5966, 2023

2023

[2] [2]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 17889– 17904, 2024

2024

[3] [3]

The ai scientist: Towards fully automated open-ended scientific discovery, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. arXiv:2408.06292

Pith/arXiv arXiv 2024

[4] [4]

Agent laboratory: Using LLM agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043. Association for Computational Linguistics, 2025

2025

[5] [5]

Researchtown: Simulator of human research community

Haofei Yu, Zhaochen Hong, Zirui Cheng, Kunlun Zhu, Keyang Xuan, Jinwei Yao, Tao Feng, and Jiaxuan You. Researchtown: Simulator of human research community. InProceedings of the International Conference on Machine Learning (ICML), pages 73051–73096, 2025

2025

[6] [6]

Vibe researching as wolf coming: Can AI agents with skills replace or augment social scientists?,

Yongjun Zhang. Vibe researching as wolf coming: Can AI agents with skills replace or augment social scientists?,

[7] [7]

A survey of frontiers in LLM reasoning: Inference scaling, learning to reason, and agentic systems, 2025

Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, et al. A survey of frontiers in LLM reasoning: Inference scaling, learning to reason, and agentic systems, 2025. arXiv:2504.09037

arXiv 2025

[8] [8]

Communication- efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication- efficient learning of deep networks from decentralized data. InProceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017

2017

[9] [9]

Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems (NeurIPS), 37:132208–132237, 2024

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan ¨O Arık. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems (NeurIPS), 37:132208–132237, 2024

2024

[10] [10]

Multi-agent collaboration mechanisms: A survey of LLMs, 2025

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of LLMs, 2025. arXiv:2501.06322

Pith/arXiv arXiv 2025

[11] [11]

Multi-agent collaboration via evolving orchestration

Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, et al. Multi-agent collaboration via evolving orchestration. InProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2025

2025

[12] [12]

The station: An open-world environment for AI-driven discovery, 2025

Stephen Chung and Wenyu Du. The station: An open-world environment for AI-driven discovery, 2025. arXiv:2511.06309

arXiv 2025

[13] [13]

Evoscientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, et al. Evoscientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026. arXiv:2603.08127. 10 Federation over Text: Insight Sharing for Multi-Agent ReasoningA PREPRINT

arXiv 2026

[14] [14]

Shall we team up: Exploring spontaneous cooperation of competing LLM agents

Zengqing Wu, Run Peng, Shuyuan Zheng, Qianying Liu, Xu Han, Brian I Kwon, Makoto Onizuka, Shaojie Tang, and Chuan Xiao. Shall we team up: Exploring spontaneous cooperation of competing LLM agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 5163–5186, 2024

2024

[15] [15]

More agents is all you need.Transactions on Machine Learning Research (TMLR), 2024

Qin Zhang, Yangbin Yu, Qiang Fu, Deheng Ye, et al. More agents is all you need.Transactions on Machine Learning Research (TMLR), 2024

2024

[16] [16]

Thought communi- cation in multiagent collaboration

Yujia Zheng, Zhuokai Zhao, Zijian Li, Yaqi Xie, Mingze Gao, Lizhu Zhang, and Kun Zhang. Thought communi- cation in multiagent collaboration. InProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2025

2025

[17] [17]

aiXiv: A next-generation open access ecosystem for scientific discovery generated by AI scientists, 2025

Pengsong Zhang, Xiang Hu, Guowei Huang, Yang Qi, Heng Zhang, Xiuxu Li, Jiaxing Song, Jiabin Luo, Yijiang Li, Shuo Yin, et al. aiXiv: A next-generation open access ecosystem for scientific discovery generated by AI scientists, 2025. arXiv:2508.15126

arXiv 2025

[18] [18]

Fedcot: Communication-efficient federated reasoning enhancement for large language models, 2025

Chuan Li, Qianyi Zhao, Fengran Mo, and Cen Chen. Fedcot: Communication-efficient federated reasoning enhancement for large language models, 2025. arXiv:2508.10020

Pith/arXiv arXiv 2025

[19] [19]

Social learning: Towards collaborative learning with large language models, 2023

Amirkeivan Mohtashami, Florian Hartmann, Sian Gooding, Lukas Zilka, Matt Sharifi, et al. Social learning: Towards collaborative learning with large language models, 2023. arXiv:2312.11441

Pith/arXiv arXiv 2023

[20] [20]

Skills (agent skills repository).��, n.d

Anthropic. Skills (agent skills repository).��, n.d. GitHub reposi- tory

[21] [21]

Claude scientific skills: A comprehensive collection of scientific tools for claude ai.�� , 2026

K-Dense Inc. Claude scientific skills: A comprehensive collection of scientific tools for claude ai.�� , 2026. skills covering databases, packages, inte- grations, and analysis tools

2026

[22] [22]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

2024

[23] [23]

Metacognitive reuse: Turning recurring LLM reasoning into concise behaviors, 2025

Aniket Didolkar, Nicolas Ballas, Sanjeev Arora, and Anirudh Goyal. Metacognitive reuse: Turning recurring LLM reasoning into concise behaviors, 2025. arXiv:2509.13237

arXiv 2025

[24] [24]

Test-time recursive thinking: Self-improvement without external feedback, 2026

Yufan Zhuang, Chandan Singh, Liyuan Liu, Yelong Shen, Dinghuai Zhang, Jingbo Shang, Jianfeng Gao, and Weizhu Chen. Test-time recursive thinking: Self-improvement without external feedback, 2026. arXiv:2602.03094

arXiv 2026

[25] [25]

Hyperagents, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents, 2026. arXiv:2603.19461

arXiv 2026

[26] [26]

Agentic context engineering: Evolving contexts for self- improving language models, 2025

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rain- ton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self- improving language models, 2025. arXiv:2510.04618

Pith/arXiv arXiv 2025

[27] [27]

G ¨odel agent: A self-referential agent framework for recursively self-improvement

Xunjian Yin, Xinyi Wang, Liangming Pan, Li Lin, Xiaojun Wan, and William Yang Wang. G ¨odel agent: A self-referential agent framework for recursively self-improvement. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 27890–27913, 2025

2025

[28] [28]

Evolving prompts in-context: An open-ended, self-replicating perspective

Jianyu Wang, Zhiqiang Hu, and Lidong Bing. Evolving prompts in-context: An open-ended, self-replicating perspective. InProceedings of the International Conference on Machine Learning (ICML), pages 63036–63087, 2025

2025

[29] [29]

Self-adapting language models

Adam Zweiger, Jyothish Pari, Han Guo, Yoon Kim, and Pulkit Agrawal. Self-adapting language models. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2025

2025

[30] [30]

Meta context engineering via agentic skill evolution, 2026

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta context engineering via agentic skill evolution, 2026. arXiv:2601.21557

arXiv 2026

[31] [31]

Reasoning pattern alignment merging for adaptive reasoning, 2026

Zhaofeng Zhong, Wei Yuan, Tong Chen, Xiangyu Zhao, Quoc Viet Hung Nguyen, and Hongzhi Yin. Reasoning pattern alignment merging for adaptive reasoning, 2026. arXiv:2601.03506

arXiv 2026

[32] [32]

OpenClaw-RL: Train any agent simply by talking, 2026

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. OpenClaw-RL: Train any agent simply by talking, 2026. arXiv:2603.10165

Pith/arXiv arXiv 2026

[33] [33]

From sparse to dense: Gpt-4 summarization with chain of density prompting

Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, and No ´emie Elhadad. From sparse to dense: Gpt-4 summarization with chain of density prompting. InProceedings of the 4th New Frontiers in Summarization Workshop, pages 68–74, 2023

2023

[34] [34]

Claude cookbooks: Examples and guides for building with claude.�� , 2024

Anthropic. Claude cookbooks: Examples and guides for building with claude.�� , 2024. Accessed: 2026-03-10. 11 Federation over Text: Insight Sharing for Multi-Agent ReasoningA PREPRINT

2024

[35] [35]

American invitational mathematics examination (AIME) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (AIME) 2024, 2024

2024

[36] [36]

American invitational mathematics examination (AIME) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (AIME) 2025, 2025

2025

[37] [37]

Are your LLMs capable of stable reasoning? InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 17594–17632, 2025

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your LLMs capable of stable reasoning? InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 17594–17632, 2025

2025

[38] [38]

Can aha moments be fake? identifying true and decora- tive thinking steps in chain-of-thought, 2025

Jiachen Zhao, Yiyou Sun, Weiyan Shi, and Dawn Song. Can aha moments be fake? identifying true and decora- tive thinking steps in chain-of-thought, 2025. arXiv:2510.24941

Pith/arXiv arXiv 2025

[39] [39]

GPQA: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof q&a benchmark. InProceedings of the First Conference on Language Modeling (COLM), 2024

2024

[40] [40]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025

[41] [41]

Humanity’s last exam, 2025

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, and et al. Humanity’s last exam, 2025. arXiv:2501.14249

Pith/arXiv arXiv 2025

[42] [42]

Openclaw: Your personal open-source AI assistant.�� , 2025

OpenClaw Contributors. Openclaw: Your personal open-source AI assistant.�� , 2025. Accessed: 2026-04-21

2025

[43] [43]

Pinchbench: Benchmarking LLM agents on real-world tasks, 2026

Kilo.ai. Pinchbench: Benchmarking LLM agents on real-world tasks, 2026. URL��

2026

[44] [44]

Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation

Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi, Bod- hisattwa Prasad Majumder, Daniel S Weld, and Peter Clark. Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13370–13467, 2025

2025

[45] [45]

Deepscientist: Advanc- ing frontier-pushing scientific findings progressively, 2025

Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. Deepscientist: Advanc- ing frontier-pushing scientific findings progressively, 2025. arXiv:2509.26603

arXiv 2025

[46] [46]

Aris: Fully autonomous research via adversarial multi-agent collabo- ration.��, 2026

Ruofeng Yang, Yongcan Li, and Shuai Li. Aris: Fully autonomous research via adversarial multi-agent collabo- ration.��, 2026

2026

[47] [47]

Dingjie Song, Hanrong Zhang, Dawei Liu, Yixin Liu, Zongxia Li, Zhengqing Yuan, Siqi Zhang, and Lichao Sun. Dr. claw: An AI research workspace from idea to paper.��, 2026

2026

[48] [48]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InProceedings of the Conference on Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

2020

[49] [49]

Weak-to-strong generalization: Eliciting strong capa- bilities with weak supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-strong generalization: Eliciting strong capa- bilities with weak supervision. InProceedings of the International Conference on Machine Learning (ICML), 2024

2024

[50] [50]

Claw-eval: Toward trustworthy evaluation of autonomous agents, 2026

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Ling- peng Kong, and et al. Claw-eval: Toward trustworthy evaluation of autonomous agents, 2026. arXiv:2604.06132

Pith/arXiv arXiv 2026

[51] [51]

Prompt stealing attacks against large language models, 2024

Zeyang Sha and Yang Zhang. Prompt stealing attacks against large language models, 2024. arXiv:2402.12959

Pith/arXiv arXiv 2024

[52] [52]

Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics (TACL), 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics (TACL), 12:157–173, 2024

2024

[53] [53]

Meta-harness: End-to-end optimization of model harnesses, 2026

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. arXiv:2603.28052

Pith/arXiv arXiv 2026

[54] [54]

Include One

Eva Y Puspaningrum, Budi Nugroho, Ariyono Setiawan, and Nuraini Hariyanti. Detection of text similarity for indication plagiarism using winnowing algorithm based k-gram and jaccard coefficient. InJournal of physics: Conference series, volume 1569, page 022044. IOP Publishing, 2020. 12 Federation over Text: Insight Sharing for Multi-Agent ReasoningA PREPRI...

arXiv 2020