Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

Alex Kwon

arxiv: 2606.25449 · v1 · pith:6G2QJB3Unew · submitted 2026-06-24 · 💻 cs.CL · cs.AI· cs.LG

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

Alex Kwon This is my paper

Pith reviewed 2026-06-25 21:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords brittle memoryreclaim evaluationsource-first compressionmemory in language modelslossy memorycorrectabilitydialogue systems

0 comments

The pith

A language model's lossy memory produces confident wrong answers where an empty memory would abstain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when a model retains a drifted conclusion but drops the source data that produced it, the model treats the stale value as correct and refuses correction, whereas an empty memory leads it to abstain. This pattern holds in the same direction across seven models with no reversals. Reclaim evaluation tests the effect by compressing an interaction at fixed budget then measuring whether a correction restores the known answer, scored exactly against ground truth. A source-first compression policy that keeps the original data and drops only the re-derivable conclusion restores correctability at the same budget. The problem scales in memory loops because one dropped-source error becomes uncorrectable and contaminates later steps.

Core claim

Brittle memory is the consistent behavioral pattern in which keeping a wrong conclusion after dropping its source is worse than keeping nothing at all; across seven models the direction never reverses. Reclaim evaluation measures this by applying fixed-budget compression to a drifted interaction, then testing whether a correction recovers the known answer, with exact scoring against ground truth and no judge. Correctability depends on whether the answer-determining source survives compression rather than on model capability. A one-line source-first policy restores correctability where the source is compact and identifiable, and this replicates across three deployed memory systems and on real

What carries the argument

Reclaim evaluation: a fixed-budget compression followed by a correction test that scores recovery of a known answer against ground truth without a judge.

If this is right

Chained memory loops turn one dropped-source error into an uncorrectable corruption that grows across downstream steps.
Source-first compression holds error span to a bounded budget horizon while length-matched random compression does not.
Past the point where the source no longer fits the budget, the source-first fix fails unless the note explicitly records that it is incomplete.
The effect appears in real dialogue data and in three different deployed memory systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same source-survival requirement would apply to any retrieval-augmented system that must later revise earlier steps.
A completeness flag in memory notes would let downstream steps detect and handle truncated sources before they propagate.
Exact-match scoring against a pre-known answer could be adapted to other tasks where the correct final state is defined in advance.

Load-bearing premise

The test treats the known answer as unambiguous ground truth and assumes the compression-plus-correction setup isolates source loss without other confounding effects from model capability or prompt wording.

What would settle it

A single model and task combination in which a lossy memory that retains the wrong conclusion but drops the source produces a higher rate of correctable answers than an empty memory under the same compression budget.

Figures

Figures reproduced from arXiv: 2606.25449 by Alex Kwon.

**Figure 1.** Figure 1: Compression decides whether an error stays fixable. A model drifts in session 1; only a compressed memory crosses into session 2, at a fixed budget. Under lossy compression the memory keeps the salient wrong conclusion and discards the source, so a later correction has nothing to recompute from, and the model does not abstain, it confidently returns the stale wrong value. Under source-first compression the… view at source ↗

**Figure 2.** Figure 2: The boundary of the source-first law. Directed Reclaim Rate vs. ledger size N at two fixed memory budgets B, n=24/point, 95% bootstrap CI. source-first (solid, llama-3.1-8b) holds while the N-item source fits B, then drops to the budget-matched lossy-padded floor (dashed) the instant any item must be dropped. The cliff moves right with the budget (N=5→14 as B doubles), so the lever is whether the answer-de… view at source ↗

**Figure 3.** Figure 3: Noise crowds the source out of a fixed budget. Directed Reclaim Rate vs. decoy count added to a four-item source at a fixed budget, n=24/point, 95% CI. Naive (positional) source-first (red) decays to the lossy floor as decoys eat the budget; relevance-aware denoised source-first (green) holds flat. The frontier confirm (dotted) coincides with the 8B model: the noise cliff is capability-invariant, because a… view at source ↗

read the original abstract

A language model's memory can be worse than having no memory at all. Give a model a memory that kept a wrong conclusion but dropped the work behind it, and it emits that stale value as a confident answer; give the same model an empty memory and it abstains. Across seven models this direction never reverses, a clean kill condition that none breaks. We call this brittle memory: behavioral, not the near-immediate information bound beneath it; only its magnitude is disposition- and task-dependent, not its direction. We measure it with reclaim evaluation: compress a drifted interaction at a fixed budget, then test whether a correction recovers the known answer, scored against ground truth with no judge. Correctability is bottlenecked by whether the answer-determining source survives, not by capability. A one-line source-first policy (keep the recomputable source, drop the re-derivable conclusion) restores correctability at equal budget where that source is compact and identifiable; a length-matched control rules out added text as the cause. The hand-built oracle reaches 1.00; a one-prompt deployable version reclaims 0.49-0.88. The stake compounds: chained through a memory loop, a single dropped-source error corrupts a growing span of downstream steps and stays uncorrectable, while source-first holds to a bounded budget horizon. The wall and fix replicate across three deployed memory systems and on real dialogue (MultiWOZ), and past the budget where the source no longer fits, the fix fails silently unless the note records completeness. This is a controlled study of a mechanism, not a benchmark: judge-free exact scoring, matched-budget controls, and validators built to come out false. We release the harness, conditions, and validators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lossy memory hurts more than empty memory across their tests, with a source-first policy that helps at matched budget, but the correction step may not fully isolate source survival from model-specific prompt effects.

read the letter

The paper's main point is that lossy memory in LLMs can produce confidently wrong answers where an empty memory would lead to abstention, and this directional pattern held across the seven models they checked. They label it brittle memory and measure it through reclaim evaluation: fixed-budget compression of a drifted interaction followed by a correction test scored by exact match to a known answer.

What is new is the consistent finding on direction plus the source-first policy that keeps the recomputable source and drops the conclusion. At equal budget this improves recovery rates, with their one-prompt version hitting 0.49-0.88 and the oracle at 1.00. They add length-matched controls, test on MultiWOZ and three deployed systems, and release the harness with validators built to fail. The chained-error argument for long-running loops is also laid out clearly.

The work is solid on the controls and the judge-free exact scoring, which sidesteps some usual evaluation noise. The release of code and conditions makes it straightforward to check.

The soft spot is whether the correction prompt truly isolates the source-loss effect. Different models could respond unevenly to that prompt even when the source is retained, which might produce the uniform direction as an artifact of the shared harness rather than a general behavioral rule. The assumption that the known answer is unambiguous ground truth also needs scrutiny in less scripted tasks.

This is for researchers building memory systems for agents and ongoing dialogue. It is a focused mechanism study with reproducible elements, so it deserves a serious referee to examine the methods and prompt details.

Referee Report

2 major / 2 minor

Summary. The paper claims that lossy memory in LLMs—retaining a drifted conclusion while dropping its source—produces confident incorrect answers, whereas empty memory leads to abstention. This 'brittle memory' effect holds in the same direction across seven models. Reclaim evaluation measures it via fixed-budget compression of drifted interactions followed by a correction test, scored by exact match to known ground truth with no judge. A source-first policy (keep recomputable source, drop re-derivable conclusion) restores correctability at matched budget; an oracle reaches 1.00 and a one-prompt version reaches 0.49-0.88. The effect and fix replicate on three deployed memory systems and MultiWOZ; a length-matched control rules out added text. The harness, conditions, and validators are released.

Significance. If the central claim holds, the work identifies a concrete mechanism by which partial memory can actively degrade performance below the no-memory baseline in memory-augmented systems, with a simple, deployable mitigation. The release of the evaluation harness, conditions, and validators is a clear strength, enabling direct reproduction and extension. The judge-free exact-match scoring and matched-budget controls further support falsifiability of the proposed mechanism.

major comments (2)

[Abstract] Abstract: the headline claim that 'this direction never reverses' across seven models is load-bearing for the brittle-memory thesis, yet the design does not report a control that holds source presence fixed while varying only the correction prompt or model-family prompt handling; without this, uniformity could reflect harness artifacts rather than a general behavioral property.
[Abstract] Abstract (reclaim evaluation description): the fixed-budget compression plus correction test is asserted to isolate whether the answer-determining source survives, but no quantitative check is described that the correction prompt's effectiveness is independent of model capability or disposition when source presence is controlled; this assumption underpins the 'correctability is bottlenecked by source survival' conclusion.

minor comments (2)

[Abstract] Abstract: the seven models and three deployed memory systems are not named; listing them would improve reproducibility.
[Abstract] Abstract: 'brittle memory' is defined behaviorally but could usefully contrast it with existing terms such as 'hallucination' or 'catastrophic forgetting' to clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on the abstract claims. We address each point below and will revise the manuscript to incorporate additional controls and clarifications as indicated.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that 'this direction never reverses' across seven models is load-bearing for the brittle-memory thesis, yet the design does not report a control that holds source presence fixed while varying only the correction prompt or model-family prompt handling; without this, uniformity could reflect harness artifacts rather than a general behavioral property.

Authors: We agree that an explicit control holding source presence fixed while varying only the correction prompt (or model-family prompt handling) would more rigorously exclude the possibility of harness artifacts. The current experiments vary models across families but apply a fixed correction prompt. While the consistency of direction across architecturally diverse models provides supporting evidence that the effect is not prompt- or harness-specific, we acknowledge the gap. In revision we will add a dedicated control experiment (and corresponding discussion) that fixes source presence and systematically varies prompt elements and model families. revision: yes
Referee: [Abstract] Abstract (reclaim evaluation description): the fixed-budget compression plus correction test is asserted to isolate whether the answer-determining source survives, but no quantitative check is described that the correction prompt's effectiveness is independent of model capability or disposition when source presence is controlled; this assumption underpins the 'correctability is bottlenecked by source survival' conclusion.

Authors: We agree that a direct quantitative check of correction-prompt effectiveness (with source presence controlled) would strengthen the isolation of source survival as the bottleneck. The manuscript reports that an oracle providing the source reaches 1.00 and that the source-first policy improves correctability at matched budget, but does not include the requested cross-model measurement with source fixed. In the revision we will add this quantitative check, reporting correction success rates across the seven models when source presence is assured. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements with released controls

full rationale

The paper reports direct experimental measurements of model behavior under fixed-budget compression and correction prompts, using exact-match scoring against a known ground-truth answer, length-matched controls, and a released harness. No equations, fitted parameters, or derivations are presented that reduce by construction to their own inputs. Claims such as 'direction never reverses across seven models' are stated as observed outcomes of the evaluation protocol rather than predictions derived from prior fits or self-citations. The design is self-contained against external benchmarks and does not rely on load-bearing self-citation chains or ansatzes smuggled via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are described. The evaluation relies on an implicit assumption that ground-truth answers exist and are stable, but none are formalized.

pith-pipeline@v0.9.1-grok · 5845 in / 1059 out tokens · 23437 ms · 2026-06-25T21:04:59.686849+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 3 linked inside Pith

[1]

Great, Now Write an Article About That: The Crescendo Multi-Turn

Russinovich, Mark and Salem, Ahmed and Eldan, Ronen , booktitle=. Great, Now Write an Article About That: The Crescendo Multi-Turn
[3]

International Conference on Learning Representations (ICLR) , year=

Towards Understanding Sycophancy in Language Models , author=. International Conference on Learning Representations (ICLR) , year=
[4]

International Conference on Learning Representations (ICLR) , year=

Large Language Models Cannot Self-Correct Reasoning Yet , author=. International Conference on Learning Representations (ICLR) , year=
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[6]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[7]

and Stoica, Ion and Gonzalez, Joseph E

Packer, Charles and Wooders, Sarah and Lin, Kevin and Fang, Vivian and Patil, Shishir G. and Stoica, Ion and Gonzalez, Joseph E. , journal=
[8]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and others , booktitle=. Retrieval-Augmented Generation for Knowledge-Intensive
[9]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in
[10]

Transactions of the Association for Computational Linguistics , volume=

Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , volume=
[11]

, booktitle=

Pennington, Jeffrey and Socher, Richard and Manning, Christopher D. , booktitle=
[12]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Budzianowski, Pawe. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

2018
[13]

2024 , howpublished =

Introducing. 2024 , howpublished =

2024
[14]

2026 , howpublished =

2026
[15]

2024 , howpublished =

mem0: The Memory Layer for. 2024 , howpublished =

2024
[16]

2022 , howpublished =

2022
[17]

Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighoff, Niklas , journal =
[18]

Zeng, Aohan and Liu, Mingdao and Lu, Rui and Wang, Bowen and Liu, Xiao and Dong, Yuxiao and Tang, Jie , journal =
[19]

glaive-function-calling-v2 , year =
[20]

Claude Opus 4.8

Anthropic . Claude Opus 4.8. https://www.anthropic.com/news/claude-opus-4-8, 2026 a . Model announcement

2026
[21]

Claude Sonnet 4.6

Anthropic . Claude Sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6, 2026 b . Model announcement

2026
[22]

MultiWOZ -- a large-scale multi-domain W izard-of- O z dataset for task-oriented dialogue modelling

Pawe Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, I \ n igo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Ga s i \'c . MultiWOZ -- a large-scale multi-domain W izard-of- O z dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 5016--5026, 2018

2018
[23]

When context flips, safety breaks: Diagnosing brittle safety in aligned language models

Dasol Choi and Alex Kwon. When context flips, safety breaks: Diagnosing brittle safety in aligned language models. arXiv preprint arXiv:2605.27851, 2026

Pith/arXiv arXiv 2026
[24]

glaive-function-calling-v2

Glaive AI . glaive-function-calling-v2. https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2, 2023. Dataset

2023
[25]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In International Conference on Learning Representations (ICLR), 2024

2024
[26]

No robots

Hugging Face H4 . No robots. https://huggingface.co/datasets/HuggingFaceH4/no_robots, 2023. Dataset

2023
[27]

LangChain

LangChain . LangChain . https://github.com/langchain-ai/langchain, 2022. Software

2022
[28]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[29]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024

2024
[30]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, et al. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[31]

mem0: The memory layer for AI agents

Mem0 AI . mem0: The memory layer for AI agents. https://github.com/mem0ai/mem0, 2024. Software

2024
[32]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022
[33]

Introducing Llama 3.1

Meta AI . Introducing Llama 3.1. https://ai.meta.com/blog/meta-llama-3-1/, 2024. Model announcement

2024
[34]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT : Towards LLMs as operating systems. arXiv preprint arXiv:2310.08560, 2023

Pith/arXiv arXiv 2023
[35]

Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack

Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack. In 34th USENIX Security Symposium, 2025

2025
[36]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, et al. Towards understanding sycophancy in language models. In International Conference on Learning Representations (ICLR), 2024

2024
[37]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[38]

Grok 4.3

xAI . Grok 4.3. https://docs.x.ai/developers/models/grok-4.3, 2026. Model card

2026
[39]

C-Pack : Packed resources for general C hinese embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-Pack : Packed resources for general C hinese embeddings. arXiv preprint arXiv:2309.07597, 2023

Pith/arXiv arXiv 2023
[40]

AgentTuning : Enabling generalized agent abilities for LLMs

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. AgentTuning : Enabling generalized agent abilities for LLMs . arXiv preprint arXiv:2310.12823, 2023

arXiv 2023

[1] [1]

Great, Now Write an Article About That: The Crescendo Multi-Turn

Russinovich, Mark and Salem, Ahmed and Eldan, Ronen , booktitle=. Great, Now Write an Article About That: The Crescendo Multi-Turn

[2] [3]

International Conference on Learning Representations (ICLR) , year=

Towards Understanding Sycophancy in Language Models , author=. International Conference on Learning Representations (ICLR) , year=

[3] [4]

International Conference on Learning Representations (ICLR) , year=

Large Language Models Cannot Self-Correct Reasoning Yet , author=. International Conference on Learning Representations (ICLR) , year=

[4] [5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[5] [6]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[6] [7]

and Stoica, Ion and Gonzalez, Joseph E

Packer, Charles and Wooders, Sarah and Lin, Kevin and Fang, Vivian and Patil, Shishir G. and Stoica, Ion and Gonzalez, Joseph E. , journal=

[7] [8]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and others , booktitle=. Retrieval-Augmented Generation for Knowledge-Intensive

[8] [9]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in

[9] [10]

Transactions of the Association for Computational Linguistics , volume=

Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , volume=

[10] [11]

, booktitle=

Pennington, Jeffrey and Socher, Richard and Manning, Christopher D. , booktitle=

[11] [12]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Budzianowski, Pawe. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

2018

[12] [13]

2024 , howpublished =

Introducing. 2024 , howpublished =

2024

[13] [14]

2026 , howpublished =

2026

[14] [15]

2024 , howpublished =

mem0: The Memory Layer for. 2024 , howpublished =

2024

[15] [16]

2022 , howpublished =

2022

[16] [17]

Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighoff, Niklas , journal =

[17] [18]

Zeng, Aohan and Liu, Mingdao and Lu, Rui and Wang, Bowen and Liu, Xiao and Dong, Yuxiao and Tang, Jie , journal =

[18] [19]

glaive-function-calling-v2 , year =

[19] [20]

Claude Opus 4.8

Anthropic . Claude Opus 4.8. https://www.anthropic.com/news/claude-opus-4-8, 2026 a . Model announcement

2026

[20] [21]

Claude Sonnet 4.6

Anthropic . Claude Sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6, 2026 b . Model announcement

2026

[21] [22]

MultiWOZ -- a large-scale multi-domain W izard-of- O z dataset for task-oriented dialogue modelling

Pawe Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, I \ n igo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Ga s i \'c . MultiWOZ -- a large-scale multi-domain W izard-of- O z dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 5016--5026, 2018

2018

[22] [23]

When context flips, safety breaks: Diagnosing brittle safety in aligned language models

Dasol Choi and Alex Kwon. When context flips, safety breaks: Diagnosing brittle safety in aligned language models. arXiv preprint arXiv:2605.27851, 2026

Pith/arXiv arXiv 2026

[23] [24]

glaive-function-calling-v2

Glaive AI . glaive-function-calling-v2. https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2, 2023. Dataset

2023

[24] [25]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In International Conference on Learning Representations (ICLR), 2024

2024

[25] [26]

No robots

Hugging Face H4 . No robots. https://huggingface.co/datasets/HuggingFaceH4/no_robots, 2023. Dataset

2023

[26] [27]

LangChain

LangChain . LangChain . https://github.com/langchain-ai/langchain, 2022. Software

2022

[27] [28]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020

[28] [29]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024

2024

[29] [30]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, et al. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[30] [31]

mem0: The memory layer for AI agents

Mem0 AI . mem0: The memory layer for AI agents. https://github.com/mem0ai/mem0, 2024. Software

2024

[31] [32]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022

[32] [33]

Introducing Llama 3.1

Meta AI . Introducing Llama 3.1. https://ai.meta.com/blog/meta-llama-3-1/, 2024. Model announcement

2024

[33] [34]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT : Towards LLMs as operating systems. arXiv preprint arXiv:2310.08560, 2023

Pith/arXiv arXiv 2023

[34] [35]

Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack

Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack. In 34th USENIX Security Symposium, 2025

2025

[35] [36]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, et al. Towards understanding sycophancy in language models. In International Conference on Learning Representations (ICLR), 2024

2024

[36] [37]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[37] [38]

Grok 4.3

xAI . Grok 4.3. https://docs.x.ai/developers/models/grok-4.3, 2026. Model card

2026

[38] [39]

C-Pack : Packed resources for general C hinese embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-Pack : Packed resources for general C hinese embeddings. arXiv preprint arXiv:2309.07597, 2023

Pith/arXiv arXiv 2023

[39] [40]

AgentTuning : Enabling generalized agent abilities for LLMs

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. AgentTuning : Enabling generalized agent abilities for LLMs . arXiv preprint arXiv:2310.12823, 2023

arXiv 2023