arxiv: 2605.10813 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

Jinhang Xu , Qiyuan Zhu , Yujun Wu , Zirui Wang , Dongxu Zhang , Jianxin Tang , Marcia Tian , Yiling Duan

show 7 more authors

Siyuan Li Jingxuan Wei Sirui Han Yike Guo Odin Zhang Conghui He Cheng Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords personalized research automationmulti-agent systemsco-evolutionskill bankmemory modulepolicy learningLLM agentsresearch pipeline

0 comments

The pith

NanoResearch uses tri-level co-evolution of skills, memory, and policy to personalize AI research automation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes NanoResearch as a multi-agent framework that personalizes LLM-powered research automation for users with different resources, preferences, and output needs. It identifies that uniform systems under-serve individuals and addresses this by accumulating reusable skills across projects, retaining user-specific experience, and internalizing implicit preferences from free-form feedback. These three layers co-evolve so that skills enrich memory, memory guides better planning, and policy updates realign the system to each user. A sympathetic reader would care because this setup allows the automation to improve output quality while lowering costs over successive uses instead of remaining static.

Core claim

NanoResearch is a multi-agent framework that addresses personalization gaps through tri-level co-evolution: a skill bank distills recurring operations into compact procedural rules reusable across projects, a memory module maintains user- and project-specific experience that grounds planning in each user's research history, and label-free policy learning converts free-form feedback into persistent parameter updates of the planner. These layers co-evolve such that reliable skills produce richer memory, richer memory informs better planning, and preference internalization continuously realigns the loop to each user. Experiments show this delivers substantial gains over state-of-the-art AI 연구系统

What carries the argument

The tri-level co-evolution mechanism consisting of a skill bank for reusable procedural rules, a memory module for user-specific history, and label-free policy learning for preference updates, which together enable progressive adaptation without explicit formalization.

If this is right

The system produces research outputs that match individual users' resource limits and methodological preferences rather than uniform defaults.
Reusable skills distilled from past projects reduce repeated effort in new work.
User-specific memory allows planning to draw on personal research history for more relevant decisions.
Label-free feedback from free-form comments leads to ongoing planner adjustments without needing formal preference statements.
Overall performance improves and costs decrease as the system runs through multiple cycles for the same user.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same co-evolution structure could be tested in other multi-agent domains such as code generation or experiment design to see if personalization emerges without domain-specific redesign.
Long-term deployment might produce measurable divergence in research style and efficiency between users with different feedback patterns.
The approach suggests a path to evaluate personalization by tracking per-user cost-quality curves rather than aggregate benchmarks.
If feedback internalization works, it could reduce the need for explicit user modeling in other adaptive AI systems.

Load-bearing premise

The three layers of skills, memory, and policy will reliably interact and improve each other using only implicit feedback to produce better personalized outputs over time.

What would settle it

Run NanoResearch and a non-co-evolving baseline on a sequence of similar research tasks for the same user profile and check whether output quality rises and total resource cost falls across cycles.

Figures

Figures reproduced from arXiv: 2605.10813 by Cheng Tan, Conghui He, Dongxu Zhang, Jianxin Tang, Jingxuan Wei, Jinhang Xu, Marcia Tian, Odin Zhang, Qiyuan Zhu, Sirui Han, Siyuan Li, Yike Guo, Yiling Duan, Yujun Wu, Zirui Wang.

**Figure 2.** Figure 2: The NanoResearch framework. An Orchestrator O processes a personalized research request and coordinates a three-stage pipeline (ideation, experimentation, writing) to produce a publicationready paper. A Skill Bank S, a Memory Module M, and policy learning jointly accumulate experience and drive self-evolution across cycles. queries, serving as persistent context for all subsequent decisions. As illustrate… view at source ↗

**Figure 3.** Figure 3: Composition of our benchmark. The 20 research tasks span seven domains, and cover a wide variety of subtasks (left), with dataset sizes ranging from ∼5K to over 1M samples (right). Baselines. We compare NanoResearch against four representative end-to-end automated research systems: AI-Researcher [27], DeepScientist [32], EvoScientist [22], and AI Scientist-v2 [35]. All systems are run under the same task s… view at source ↗

**Figure 4.** Figure 4: Per-task performance of NanoResearch. The most pronounced advantage emerges on Compliance (8.963 vs. 6.656), confirming that the user profile U and SDPO-based feedback internalization let NanoResearch faithfully respect heterogeneous user preferences. Performance further improves monotonically from Round 1 to Round 3 on all dimensions, with notable gains on Innovation (4.960 → 5.645) and Expression (5.42… view at source ↗

**Figure 5.** Figure 5: Case study on UCI HAR: three simulated users with [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: All pages of the system-generated sensor time-series paper [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: All pages of the system-generated tabular regression paper [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: All pages of the system-generated keyword spotting paper [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Architecture diagram for Profile A (Evidence-First Scientist). [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Architecture diagram for Profile B (Ablation-Focused Researcher). [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Architecture diagram for Profile C (Benchmark-Driven Exploratory Researcher). [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

read the original abstract

LLM-powered multi-agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under different resource configurations, hold different methodological preferences, and target different output formats. A system that produces uniform outputs regardless of these differences will systematically under-serve every individual user, making personalization a precondition for research automation to be genuinely usable. However, achieving it requires three capabilities that current systems lack: accumulating reusable procedural knowledge across projects, retaining user-specific experience across sessions, and internalizing implicit preferences that resist explicit formalization. We propose NanoResearch, a multi-agent framework that addresses these gaps through tri-level co-evolution. A skill bank distills recurring operations into compact procedural rules reusable across projects. A memory module maintains user- and project-specific experience that grounds planning decisions in each user's research history. A label-free policy learning converts free-form feedback into persistent parameter updates of the planner, reshaping subsequent coordination. These three layers co-evolve: reliable skills produce richer memory, richer memory informs better planning, and preference internalization continuously realigns the loop to each user. Extensive experiments demonstrate that NanoResearch delivers substantial gains over state-of-the-art AI research systems, and progressively refines itself to produce better research at lower cost over successive cycles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NanoResearch sketches a tri-level co-evolution loop for personalizing research agents but rests its gains claim on high-level assertions rather than detailed evidence.

read the letter

The main takeaway is that this paper targets a practical gap: current LLM research agents produce one-size-fits-all outputs while real users differ in resources, methods, and goals. NanoResearch tries to fix that with three components that update each other over time—a skill bank for reusable steps, a memory module for user history, and label-free policy updates from free-form feedback. The idea is that better skills feed richer memory, which improves planning, which in turn absorbs preferences without explicit labels. That loop is the central proposal and it is new as a combined package even if the individual pieces draw from prior multi-agent work. The paper does a clear job laying out why uniform systems under-serve users and why personalization is a precondition for usable automation. The motivation section is direct and the architecture description is straightforward to follow. The soft spots are in the results. The abstract states substantial gains and progressive refinement at lower cost, yet it gives no concrete baselines, metrics for research quality, cost accounting, or controls for the successive cycles. Without those, it is hard to judge whether the co-evolution actually drives the improvement or whether other factors are at work. The assumption that the three layers will interact productively without drift or negative loops also stays untested in the visible text. This paper is aimed at groups building LLM agents for end-to-end research tasks. Someone looking for concrete architectural patterns around adaptation would find usable ideas here. It deserves a serious referee because the problem is real, the proposal is coherent, and the gap it addresses matters for deployment. I would send it to peer review with a request for expanded experimental details and ablations on the co-evolution mechanism.

Referee Report

2 major / 1 minor

Summary. The paper proposes NanoResearch, a multi-agent framework for personalized LLM-powered research automation. It introduces tri-level co-evolution via a skill bank that distills reusable procedural rules, a memory module that retains user- and project-specific experience, and label-free policy learning that converts free-form feedback into planner parameter updates. The central claim is that these layers co-evolve productively to deliver substantial gains over state-of-the-art AI research systems while progressively refining output quality and reducing costs across successive cycles.

Significance. If validated, the work would address a genuine gap in research automation by making systems adaptable to individual researchers' preferences and histories rather than producing uniform outputs. The co-evolution architecture is a coherent conceptual contribution that could inform future adaptive agent designs. However, the absence of any reported metrics, baselines, or controls in the manuscript substantially limits its current significance and falsifiability.

major comments (2)

[Abstract] Abstract, final sentence: the assertion of 'substantial gains over state-of-the-art AI research systems' and 'progressively refines itself to produce better research at lower cost' is load-bearing for the primary contribution yet is unsupported by any quantitative results, specific metrics, baselines, number of trials, or controls, rendering the empirical claim unverifiable.
[Abstract] Abstract, paragraph 3: the description of the three layers co-evolving (skills produce richer memory, memory informs planning, label-free feedback internalizes preferences) is presented without algorithms, interaction protocols, or pseudocode, so the mechanism by which the components are claimed to reinforce one another cannot be evaluated for internal consistency or feasibility.

minor comments (1)

[Abstract] Abstract: 'LLM' is used without expansion on first occurrence (though standard in the field).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying key areas where the presentation of our contributions can be strengthened. We address each major comment below and commit to revisions that improve verifiability and clarity without altering the core framework.

read point-by-point responses

Referee: [Abstract] Abstract, final sentence: the assertion of 'substantial gains over state-of-the-art AI research systems' and 'progressively refines itself to produce better research at lower cost' is load-bearing for the primary contribution yet is unsupported by any quantitative results, specific metrics, baselines, number of trials, or controls, rendering the empirical claim unverifiable.

Authors: We agree that the abstract's empirical claims should be directly supported by concrete details to allow immediate verification. The manuscript's Experiments section reports comparisons against multiple baselines on repeated research tasks and documents progressive improvements in output quality and resource usage across cycles. To resolve the concern, we will revise the abstract to include explicit references to the baselines, the number of evaluation cycles, and the observed trends in quality and cost metrics, while ensuring the full quantitative results remain prominently detailed in the main text. revision: yes
Referee: [Abstract] Abstract, paragraph 3: the description of the three layers co-evolving (skills produce richer memory, memory informs planning, label-free feedback internalizes preferences) is presented without algorithms, interaction protocols, or pseudocode, so the mechanism by which the components are claimed to reinforce one another cannot be evaluated for internal consistency or feasibility.

Authors: The tri-level co-evolution process, including the skill distillation procedure, memory update and retrieval rules, and the label-free policy update from free-form feedback, is specified with equations and interaction flow in Section 3 of the manuscript. We acknowledge that the abstract provides only a high-level summary. In the revision we will add a compact algorithmic outline of the co-evolution loop and include pseudocode as a new figure or appendix entry so that the reinforcement mechanisms can be directly inspected for consistency and feasibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an architectural framework for tri-level co-evolution of skills, memory, and policy in a multi-agent research automation system. No mathematical derivations, equations, fitted parameters, or first-principles predictions are described in the abstract or claimed structure. Claims of performance gains rest on empirical experiments rather than any self-referential reduction of outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way within the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The paper introduces three new system components as the core contribution. No explicit numerical free parameters are stated in the abstract. The framework rests on domain assumptions about LLM capabilities for skill distillation and feedback internalization.

axioms (1)

domain assumption LLM agents can reliably distill recurring operations into reusable procedural rules and internalize implicit preferences from free-form feedback
This assumption underpins the skill bank and label-free policy learning components described in the abstract.

invented entities (3)

Skill bank no independent evidence
purpose: Distills recurring operations into compact procedural rules reusable across projects
New component introduced to accumulate procedural knowledge.
Memory module no independent evidence
purpose: Maintains user- and project-specific experience to ground planning
New component for retaining user-specific history.
Label-free policy learning no independent evidence
purpose: Converts free-form feedback into persistent parameter updates of the planner
New mechanism for preference internalization without labels.

pith-pipeline@v0.9.0 · 5580 in / 1422 out tokens · 54366 ms · 2026-05-12T04:07:31.055923+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Researchagent: Iterative research idea generation over scientific literature with large language models

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pag...

work page 2025
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Scibert: A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pretrained language model for scientific text. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3615–3620, 2019

work page 2019
[6]

Aligning language models from user interactions.arXiv preprint arXiv:2603.12273,

Thomas Kleine Buening, Jonas Hübotter, Barna Pásztor, Idan Shenfeld, Giorgia Ramponi, and Andreas Krause. Aligning language models from user interactions.arXiv preprint arXiv:2603.12273, 2026

work page arXiv 2026
[7]

Tldr: Extreme summarization of scientific documents

Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S Weld. Tldr: Extreme summarization of scientific documents. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4766–4777, 2020

work page 2020
[8]

Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence

Jeff Clune. Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelli- gence.arXiv preprint arXiv:1905.10985, 2019

work page arXiv 1905
[9]

Researchcodeagent: An llm multi-agent system for automated codification of research methodologies

Shubham Gandhi, Dhruv Shah, Manasi Patwardhan, Lovekesh Vig, and Gautam Shroff. Researchcodeagent: An llm multi-agent system for automated codification of research methodologies. InInternational Workshop on AI for Transportation, pages 3–37. Springer, 2025

work page 2025
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Nova: An iterative planning and search approach to enhance novelty and diversity of llm generated ideas.arXiv preprint arXiv:2410.14255, 2024

Xiang Hu, Hongyu Fu, Jinge Wang, Yifeng Wang, Zhikun Li, Renjun Xu, Yu Lu, Yaochu Jin, Lili Pan, and Zhenzhong Lan. Nova: An iterative planning and search approach to enhance novelty and diversity of llm generated ideas.arXiv preprint arXiv:2410.14255, 2024

work page arXiv 2024
[12]

ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission.arXiv preprint arXiv:1904.05342, 2019

work page internal anchor Pith review arXiv 1904
[13]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

University of Chicago press Chicago, 1970

Thomas S Kuhn and Ian Hacking.The structure of scientific revolutions, volume 2. University of Chicago press Chicago, 1970

work page 1970
[15]

MIT press, 1987

Pat Langley.Scientific discovery: Computational explorations of the creative processes. MIT press, 1987

work page 1987
[16]

Laboratory life: The construction of scientific facts

Bruno Latour, Jonas Salk, and Steve Woolgar. Laboratory life: The construction of scientific facts. 2013

work page 2013
[17]

Biobert: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

work page 2020
[18]

AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, et al. Autosota: An end-to-end automated research system for state-of-the-art ai model discovery.arXiv preprint arXiv:2604.05550, 2026. 17 NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Autop2c: An llm-based agent framework for code repository generation from multimodal content in academic papers.arXiv preprint arXiv:2504.20115, 2025

Zijie Lin, Yiqing Shen, Qilin Cai, He Sun, Jinrui Zhou, and Mingjun Xiao. Autop2c: An llm-based agent framework for code repository generation from multimodal content in academic papers.arXiv preprint arXiv:2504.20115, 2025

work page arXiv 2025
[20]

Troubling trends in machine learning scholarship: Some ml papers suffer from flaws that could mislead the public and stymie future research.Queue, 17(1):45–77, 2019

Zachary C Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship: Some ml papers suffer from flaws that could mislead the public and stymie future research.Queue, 17(1):45–77, 2019

work page 2019
[21]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

arXiv preprint arXiv:2603.08127 (2026)

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, et al. Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery.arXiv preprint arXiv:2603.08127, 2026

work page arXiv 2026
[23]

Rodriques, and Andrew D

Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. Paperqa: Retrieval-augmented generative agent for scientific research, 2023. URL https://arxiv. org/abs/2312.07559

work page arXiv 2023
[24]

Foundation models for generalist medical artificial intelligence.Nature, 616(7956): 259–265, 2023

Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616(7956): 259–265, 2023

work page 2023
[25]

Green ai.Communications of the ACM, 63(12): 54–63, 2020

Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai.Communications of the ACM, 63(12): 54–63, 2020

work page 2020
[26]

Omniscientist: Toward a co-evolving ecosystem of human and ai scientists

Chenyang Shao, Dehao Huang, Yu Li, Keyu Zhao, Weiquan Lin, Yining Zhang, Qingbin Zeng, Zhiyu Chen, Tianxing Li, Yifei Huang, et al. Omniscientist: Toward a co-evolving ecosystem of human and ai scientists. arXiv preprint arXiv:2511.16931, 2025

work page arXiv 2025
[27]

2025 , doi =

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation. arXiv preprint arXiv:2505.18705, 2025

work page arXiv 2025
[28]

Internagent: When agent becomes the scientist–building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025

InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, et al. Internagent: When agent becomes the scientist–building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025

work page arXiv 2025
[29]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Automating science.Science, 324(5923):43–44, 2009

David Waltz and Bruce G Buchanan. Automating science.Science, 324(5923):43–44, 2009

work page 2009
[32]

International Conference on Learning Representations (ICLR) , year =

Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively.arXiv preprint arXiv:2509.26603, 2025

work page arXiv 2025
[33]

The shaky foundations of large language models and foundation models for electronic health records.npj digital medicine, 6(1):135, 2023

Michael Wornow, Yizhe Xu, Rahul Thapa, Birju Patel, Ethan Steinberg, Scott Fleming, Michael A Pfeffer, Jason Fries, and Nigam H Shah. The shaky foundations of large language models and foundation models for electronic health records.npj digital medicine, 6(1):135, 2023

work page 2023
[34]

An empirical analysis of uncertainty in large language model evaluations.arXiv preprint arXiv:2502.10709, 2025

Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, and Linyi Yang. An empirical analysis of uncertainty in large language model evaluations.arXiv preprint arXiv:2502.10709, 2025

work page arXiv 2025
[35]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Ai becomes a masterbrain scientist.bioRxiv, pages 2023–04, 2023

Zijie Yang, Yukai Wang, and Lijing Zhang. Ai becomes a masterbrain scientist.bioRxiv, pages 2023–04, 2023

work page 2023
[37]

"" 3Stable encoder for a controlled UCI HAR study. 4The design uses fixed temporal scales rather than sample-adaptive routing. 5

Minjun Zhu, Qiujie Xie, Yixuan Weng, Jian Wu, Zhen Lin, Linyi Yang, and Yue Zhang. Ai scientists fail without strong implementation capability.arXiv preprint arXiv:2506.01372, 2025. 18 NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation Appendix A User Requirement Alignment Prompt TheCompliance Score(Align.) measures ...

work page arXiv 2025