GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

Abhishek H S; Aswanth Krishnan; Pavan C Shekar

arxiv: 2606.14470 · v2 · pith:EBNPNK2Wnew · submitted 2026-06-12 · 💻 cs.AI · cs.CL· cs.LG

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

Pavan C Shekar , Abhishek H S , Aswanth Krishnan This is my paper

Pith reviewed 2026-06-27 04:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords gitagent memorychain-of-thoughtversion controlself-consistencyreasoning benchmarksretrieval

0 comments

The pith

Memory from past problems boosts agent accuracy only when new problems closely match stored examples, while git storage adds auditability without changing results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GitOfThoughts to store reasoning trees in a git repository so that thoughts become commits, scores become notes, and histories can be replayed, diffed, or merged. It then runs controlled tests of five memory stores on two benchmarks with multiple model sizes and repeats to check whether access to past reasoning actually raises accuracy on fresh problems. The central finding is that retrieval adds nothing unless the new problem exceeds roughly 0.8 cosine similarity with something already stored; below that threshold the model simply locates a near-copy rather than learning a reusable method. Self-consistency through multiple samples remains the only intervention that reliably helps on dissimilar tasks. Git therefore matches the accuracy of simpler stores while supplying version-control features at no performance cost.

Core claim

GitOfThoughts represents an agent's reasoning tree as a git repository in which each scored thought is a commit, outcomes are tags, and retrieval reduces to git log over the agent's own history. Across two benchmarks, two model sizes, and repeated trials, none of the five memory stores (including the git version) improved accuracy on new problems unless those problems were nearly identical to stored ones; even a 4.5-times-larger model extracted no reusable methods, only better near-copy detection. The sole consistent gain on novel problems came from self-consistency sampling.

What carries the argument

Git repository as memory store for reasoning trees, turning thoughts into commits and enabling standard version-control operations on the agent's history.

If this is right

Self-consistency sampling improves accuracy on new problems while memory retrieval does not.
Even substantially larger models still cannot derive reusable methods from dissimilar examples.
All five memory stores produce statistically equivalent accuracy on dissimilar problems.
Git storage supplies replay, diff, and merge capabilities without reducing accuracy relative to other stores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current models appear to solve reasoning tasks by surface matching rather than by internalizing general procedures.
Version-controlled reasoning histories could support merging or auditing across multiple independent agents.
Efforts to improve reasoning might shift from retrieval augmentation toward techniques that force extraction of methods across low-similarity cases.

Load-bearing premise

The benchmarks contain enough genuinely novel problems that are not near-duplicates of the stored memory, and the five memory implementations differ only in storage mechanism.

What would settle it

An experiment in which a model extracts and correctly applies a reasoning method from a worked example whose cosine similarity to the target problem is below 0.8.

Figures

Figures reproduced from arXiv: 2606.14470 by Abhishek H S, Aswanth Krishnan, Pavan C Shekar.

**Figure 1.** Figure 1: The reasoning tree is a git repository. Each scored thought is a commit with author, timestamp, and content-hash metadata; scores are git notes; validation outcomes are tags (success_*, failed_*); pruned attempts remain in history rather than vanishing. The winning path merges to main, an answer-free lesson is distilled to a long-lived memory branch, and retrieval is git log (--grep, -S, tag filters) over … view at source ↗

**Figure 2.** Figure 2: What actually moves accuracy (MATH-500, n=500). Per-arm ∆ vs. no-memory with paired bootstrap 95% CIs. Only self-consistency (green) clears zero; every memory substrate and the static-few-shot control sit within noise. does ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The copyability threshold. Memory helps only above cosine ≈0.8 (shaded), where the retrieved case is a near-duplicate. The method band (cosine 0.72: same method, different numbers, non-copyable answer) is null at both scales: a 4.5× larger model steepens the step (doubling the near-duplicate payoff) without unlocking method transfer. 7B plateaued at 65%), while same-subject and unrelated tiers stay null. S… view at source ↗

read the original abstract

Large language model reasoning leaves no trace once it is done. The steps of a chain of thought disappear when the context window closes, a pruned search branch is just gone, and memory buffers cannot be diffed, merged, or audited. Code, infrastructure, and experiments are all version-controlled. Reasoning is not. GitOfThoughts stores an agent's reasoning tree as a git repository. Every scored thought becomes a commit, scores become notes, outcomes become tags, and retrieval is just git log over the agent's own history. We use this to test something simple. Does giving an agent memory from past problems actually make it more accurate? We tried five memory stores (none, a markdown file, a vector database, a graph, and git) across two benchmarks, two model sizes, and several pre-registered repeat experiments. The answer, on new problems, is no, including one promising early result that did not hold up when we repeated it. Memory only helps once the problem being solved is nearly identical to something already in memory (cosine similarity above about 0.8); below that, it does nothing. In other words, the model is finding the answer rather than learning the method. Even a model 4.5x larger still cannot pull a reusable method out of a worked example; it just gets better at spotting near-copies. The only thing that reliably helped on new problems was generating several answers and picking the most common one (self-consistency). So the case for using git as the memory store is not that it retrieves better. It is that it gives auditability, history, and the ability to merge two agents' memories, at no cost to accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Git for reasoning trees is a useful engineering move, but the memory-transfer claim depends on unshown equivalence across the five stores.

read the letter

The paper's core move is turning an agent's chain-of-thought into a git repo so every scored step is a commit, scores are notes, and retrieval is just git log. That part is straightforward and gives replay, diff, and merge for free. They then ran the same agent with five memory backends (none, markdown, vector DB, graph, git) on two benchmarks, two model sizes, and pre-registered repeats. The headline result is that memory only raises accuracy when the new problem sits above ~0.8 cosine similarity to something already stored; below that it does nothing. Self-consistency was the only thing that helped on genuinely new items. Even the 4.5x larger model just got better at spotting near-copies rather than extracting reusable methods.

What lands is the empirical threshold and the honest note that one early positive result failed to replicate. The git implementation itself adds auditability without an accuracy penalty, which is a clean practical point.

The soft spot is the assumption that the five stores differed only in their storage layer. The abstract does not show that retrieval prompts, top-k logic, context formatting, or scoring were held identical; any drift there would make the "memory does nothing" result hard to interpret. The 0.8 cutoff also needs the exact embedding model and a check that the held-out problems really lack method-level overlap even if surface forms differ. Those details matter for the central claim.

This is worth a serious referee for people working on agent memory and traceability. The empirical limit on transfer is worth checking even if the git wrapper is mostly an implementation detail. I would send it to review.

Referee Report

3 major / 2 minor

Summary. The paper introduces GitOfThoughts, which represents an LLM agent's reasoning tree as a git repository so that thoughts become commits, scores become notes, and retrieval reduces to git log. It reports experiments across five memory stores (none, markdown, vector DB, graph, git), two benchmarks, two model sizes, and pre-registered repeats, finding that memory improves accuracy only when test problems have cosine similarity >0.8 to stored items; below that threshold memory confers no benefit, even for a 4.5 imes larger model. Self-consistency is the only intervention that reliably helps on novel problems. The authors conclude that models locate near-copies rather than extract reusable methods and that git's value is auditability and mergeability rather than retrieval performance.

Significance. If the empirical pattern holds, the result challenges the widespread assumption that retrieved memory improves reasoning on out-of-distribution problems and supplies concrete evidence that current LLMs do not reliably abstract methods from single worked examples. The pre-registered multi-condition design and explicit comparison of storage back-ends constitute a methodological strength that grounds the negative finding about memory transfer.

major comments (3)

[Methods] Methods section: the claim that the five memory stores differ only in storage/retrieval backend requires explicit confirmation that prompting templates, top-k selection logic, context-injection format, and scoring procedures were held identical; without this, any divergence confounds the central result that memory confers no benefit below cosine 0.8.
[Results] Results on similarity threshold: the 0.8 cosine cutoff is reported without naming the embedding model, the precise similarity formula, or verification that held-out problems contain no methodologically analogous but surface-dissimilar items; this detail is load-bearing for the interpretation that models extract no reusable methods.
[Abstract] Abstract and experimental reporting: pre-registered repeats are mentioned, yet the manuscript supplies neither full methods, raw data, error bars, nor implementation artifacts; this absence prevents independent verification of the non-replicating early result and of the equivalence assumption.

minor comments (2)

[Figures] Figure captions should state the exact embedding model and similarity metric used for the cosine analysis.
[Tables] Notation for the five memory conditions is introduced inconsistently between text and tables; standardize the labels.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and indicate planned revisions to improve clarity and verifiability.

read point-by-point responses

Referee: [Methods] Methods section: the claim that the five memory stores differ only in storage/retrieval backend requires explicit confirmation that prompting templates, top-k selection logic, context-injection format, and scoring procedures were held identical; without this, any divergence confounds the central result that memory confers no benefit below cosine 0.8.

Authors: The experimental protocol held all elements other than the storage/retrieval backend constant: identical prompting templates, top-k=5 selection, context-injection format, and scoring procedures were used across conditions. We will insert an explicit confirmation paragraph in the Methods section documenting these controls. revision: yes
Referee: [Results] Results on similarity threshold: the 0.8 cosine cutoff is reported without naming the embedding model, the precise similarity formula, or verification that held-out problems contain no methodologically analogous but surface-dissimilar items; this detail is load-bearing for the interpretation that models extract no reusable methods.

Authors: We will name the embedding model (sentence-transformers/all-MiniLM-L6-v2), state that cosine similarity was computed on L2-normalized vectors, and add a note on manual verification that held-out problems contained no methodologically analogous but lexically dissimilar items. These details will be placed in the Results section. revision: yes
Referee: [Abstract] Abstract and experimental reporting: pre-registered repeats are mentioned, yet the manuscript supplies neither full methods, raw data, error bars, nor implementation artifacts; this absence prevents independent verification of the non-replicating early result and of the equivalence assumption.

Authors: We will expand the Methods section with additional pre-registration and procedural details and add error bars to all figures. The full code, prompts, and datasets will be released in a public repository upon acceptance to support verification; the current manuscript length constraints limit inclusion of raw data tables. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparison with no derivations or self-referential reductions

full rationale

The paper presents an empirical study comparing five memory-store implementations (none, markdown, vector DB, graph, git) on two benchmarks across model sizes and pre-registered repeats. The central finding—that memory improves accuracy only for cosine similarity >0.8 and otherwise provides no benefit—is reported as a direct observation from the experiments, not derived from any equations, fitted parameters renamed as predictions, or self-citation chains. No mathematical derivation chain exists; the work contains no first-principles results, uniqueness theorems, or ansatzes that could reduce to inputs by construction. The assumptions flagged in the skeptic note (equivalence of stores and novelty of test problems) are methodological limitations but do not constitute circularity under the defined patterns, as there is no load-bearing step that equates a claimed result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; the work relies on standard assumptions about LLM prompting and benchmark validity rather than new axioms or entities with independent evidence.

axioms (1)

domain assumption LLM reasoning steps can be generated, scored, and stored as discrete commits without altering core model behavior.
Implicit in the git storage approach described in the abstract.

invented entities (1)

GitOfThoughts memory store no independent evidence
purpose: Version-controlled storage of agent reasoning trees for replay, diff, and merge.
New system introduced to enable the memory experiments.

pith-pipeline@v0.9.1-grok · 5854 in / 1389 out tokens · 41090 ms · 2026-06-27T04:46:14.800710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 11 canonical work pages · 10 internal anchors

[1]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory. arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenen- baum, and Igor Mordatch. Improving factuality and rea- soning in language models through multiagent debate. arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceed- ings of EMNLP, 2023

2023
[4]

Hatchet: A distributed, durable task queue (soft- ware)

Hatchet. Hatchet: A distributed, durable task queue (soft- ware). https://github.com/hatchet-dev/hatchet, 2024

2024
[5]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021

2021
[6]

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

Abhishek HS, Pavan C. Shekar, Arpit Jain, and Aswanth Krishnan. ReTreVal: Reasoning tree with validation – a hybrid framework for enhanced LLM multi-step reasoning. arXiv:2601.02880, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao 9 Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InPro- ceedings of SOSP, 2023

2023
[8]

SwiftSage: A generative agent with fast and slow thinking for complex interactive tasks

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brah- man, Shiyu Huang, Chandra Bhagavatula, Prithviraj Am- manabrolu, Yejin Choi, and Xiang Ren. SwiftSage: A generative agent with fast and slow thinking for complex interactive tasks. InNeurIPS, 2023

2023
[9]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InNeurIPS, 2023

2023
[10]

CLIN: A continually learn- ing language agent for rapid task adaptation and generaliza- tion

Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Pe- ter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. CLIN: A continually learn- ing language agent for rapid task adaptation and generaliza- tion. arXiv:2310.10134, 2023

work page arXiv 2023
[11]

Re- thinking the role of demonstrations: What makes in-context learning work? InProceedings of EMNLP, 2022

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Re- thinking the role of demonstrations: What makes in-context learning work? InProceedings of EMNLP, 2022

2022
[12]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonza- lez. MemGPT: Towards LLMs as operating systems. arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of UIST, 2023

2023
[14]

Sentence-BERT: Sen- tence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sen- tence embeddings using siamese BERT-networks. InPro- ceedings of EMNLP, 2019

2019
[15]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof Q&A benchmark. arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In NeurIPS, 2023

2023
[17]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more ef- fective than scaling model parameters. arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandku- mar. V oyager: An open-ended embodied agent with large language models. arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

ScienceWorld: Is your agent smarter than a 5th grader? InProceedings of EMNLP, 2022

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? InProceedings of EMNLP, 2022

2022
[20]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023

2023
[21]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022

2022
[22]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents. arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report. arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurIPS, 2023

2023
[25]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2023

2023
[26]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text. arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Accelerating the machine learning lifecycle with MLflow.IEEE Data Engineering Bulletin, 41(4), 2018

Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Gh- odsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, and Corey Zumar. Accelerating the machine learning lifecycle with MLflow.IEEE Data Engineering Bulletin, 41(4), 2018

2018
[28]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of AAAI, 2024

2024
[29]

MemoryBank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory. InProceedings of AAAI, 2024

2024
[30]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Hao- han Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. InICML, 2024. 10

2024

[1] [1]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory. arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenen- baum, and Igor Mordatch. Improving factuality and rea- soning in language models through multiagent debate. arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceed- ings of EMNLP, 2023

2023

[4] [4]

Hatchet: A distributed, durable task queue (soft- ware)

Hatchet. Hatchet: A distributed, durable task queue (soft- ware). https://github.com/hatchet-dev/hatchet, 2024

2024

[5] [5]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021

2021

[6] [6]

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

Abhishek HS, Pavan C. Shekar, Arpit Jain, and Aswanth Krishnan. ReTreVal: Reasoning tree with validation – a hybrid framework for enhanced LLM multi-step reasoning. arXiv:2601.02880, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao 9 Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InPro- ceedings of SOSP, 2023

2023

[8] [8]

SwiftSage: A generative agent with fast and slow thinking for complex interactive tasks

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brah- man, Shiyu Huang, Chandra Bhagavatula, Prithviraj Am- manabrolu, Yejin Choi, and Xiang Ren. SwiftSage: A generative agent with fast and slow thinking for complex interactive tasks. InNeurIPS, 2023

2023

[9] [9]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InNeurIPS, 2023

2023

[10] [10]

CLIN: A continually learn- ing language agent for rapid task adaptation and generaliza- tion

Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Pe- ter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. CLIN: A continually learn- ing language agent for rapid task adaptation and generaliza- tion. arXiv:2310.10134, 2023

work page arXiv 2023

[11] [11]

Re- thinking the role of demonstrations: What makes in-context learning work? InProceedings of EMNLP, 2022

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Re- thinking the role of demonstrations: What makes in-context learning work? InProceedings of EMNLP, 2022

2022

[12] [12]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonza- lez. MemGPT: Towards LLMs as operating systems. arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of UIST, 2023

2023

[14] [14]

Sentence-BERT: Sen- tence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sen- tence embeddings using siamese BERT-networks. InPro- ceedings of EMNLP, 2019

2019

[15] [15]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof Q&A benchmark. arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In NeurIPS, 2023

2023

[17] [17]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more ef- fective than scaling model parameters. arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandku- mar. V oyager: An open-ended embodied agent with large language models. arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

ScienceWorld: Is your agent smarter than a 5th grader? InProceedings of EMNLP, 2022

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? InProceedings of EMNLP, 2022

2022

[20] [20]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023

2023

[21] [21]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022

2022

[22] [22]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents. arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report. arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurIPS, 2023

2023

[25] [25]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2023

2023

[26] [26]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text. arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Accelerating the machine learning lifecycle with MLflow.IEEE Data Engineering Bulletin, 41(4), 2018

Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Gh- odsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, and Corey Zumar. Accelerating the machine learning lifecycle with MLflow.IEEE Data Engineering Bulletin, 41(4), 2018

2018

[28] [28]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of AAAI, 2024

2024

[29] [29]

MemoryBank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory. InProceedings of AAAI, 2024

2024

[30] [30]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Hao- han Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. InICML, 2024. 10

2024