CALMem : Application-Layer Dual Memory for Conversational AI

Rajan Padmanabhan; Rajendra Narayan Jena; Sankar Arumugam

arxiv: 2605.20724 · v1 · pith:FMICKMJ4new · submitted 2026-05-20 · 💻 cs.IR

CALMem : Application-Layer Dual Memory for Conversational AI

Rajendra Narayan Jena , Rajan Padmanabhan , Sankar Arumugam This is my paper

Pith reviewed 2026-05-21 02:54 UTC · model grok-4.3

classification 💻 cs.IR

keywords conversational AILLM context managementmemory architectureepisodic memorysemantic memoryretrieval augmented generationapplication layer

0 comments

The pith

CALMem layers dual memories on top of any LLM to deliver virtually unlimited effective conversation context without model changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models lose conversational continuity once their fixed context window fills or a session ends, because compaction throws away history and resets erase everything. The paper shows how an application-layer system can keep that history available by maintaining two separate memory stores: one that embeds recent turns as vectors for quick lookup and another that holds agent-written facts in structured form. A special injection step called MOIM then decides how much of this stored material to bring back each turn, scaling the amount according to remaining token space and pulling in even previously compacted turns from the same session. This approach works with unmodified models from any provider and adds no cost when switched off, so the same assistant can handle far longer interactions while staying compatible with existing infrastructure.

Core claim

The paper claims that an application-layer dual memory architecture, built from episodic memory using sliding-window vector embeddings of conversation history plus semantic memory of agent-writable structured facts, together with a token-budget-adaptive MOIM injection mechanism, supplies LLM-based conversational assistants with virtually unbounded effective context while remaining fully compatible with unmodified models and preserving access to intra-session compacted turns.

What carries the argument

The MOIM adaptive injection mechanism that retrieves relevant entries from both episodic and semantic memory layers and inserts them into the prompt, scaling injection depth inversely with current context pressure.

If this is right

Conversations can continue across separate sessions without complete memory loss or manual re-summarization.
Existing LLMs gain longer effective context length without any weight changes, fine-tuning, or provider-specific features.
Turns that were compacted inside the current session remain retrievable rather than permanently discarded.
The architecture works with any LLM provider and reverts to the original model behavior with zero added cost when disabled.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Persistent user-specific facts stored in the semantic layer could support assistants that remember preferences across weeks or months.
The same retrieval approach might be extended to cross-session search so that earlier conversations become reusable knowledge.
Teams could test alternative embedding models or injection heuristics on top of this layer without touching the base LLM.

Load-bearing premise

Vector embeddings of conversation history and the MOIM injection step will reliably fetch and insert the right past context without adding noise, hallucinations, or breaks in conversational flow.

What would settle it

Extended testing on long multi-session dialogues that shows repeated injection of irrelevant or incorrect past material producing incoherent or hallucinated replies.

Figures

Figures reproduced from arXiv: 2605.20724 by Rajan Padmanabhan, Rajendra Narayan Jena, Sankar Arumugam.

**Figure 2.** Figure 2: Indexing pipeline. The message is committed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Sliding-window chunking (size=1000, over [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: MOIM token budget tiers. Injection depth [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Large language models (LLMs) operate within fixed context windows that fundamentally limit conversational continuity. When context fills, compaction discards history irreversibly; when sessions end, all memory resets to zero. Existing solutions-larger context windows, retrieval-augmented generation for knowledge bases, and memory-augmented architectures such as MemGPT-either require model modification, impose provider lock-in, or do not address the compaction continuity problem. We present CALMem (Conversational Application-Layer Memory), an application-layer dual memory architecture that gives LLM-based conversational assistants virtually unbounded effective context without any modification to the underlying model. CALMem combines two complementary memory subsystems: an episodic memory layer built on sliding-window vector embeddings of conversation history, and a semantic memory layer of agent-writable structured facts. A token-budget-adaptive injection mechanism, called the MOIM (Message of Injected Memory), automatically retrieves and injects relevant past context each turn, scaling injection depth inversely with context pressure. A key contribution is intra-session retrieval: compacted away turns from the current session remain searchable, closing a gap unaddressed by prior work. The system is implemented as a pure application layer in a production Rust codebase, is provider-agnostic, and degrades to original LLM behaviour with zero overhead when disabled. We describe the architecture, design decisions, and performance characteristics, and analyse the trade-offs that guided each implementation choice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CALMem, an application-layer dual-memory architecture for LLM-based conversational agents. It combines an episodic memory subsystem (sliding-window vector embeddings of conversation history) with a semantic memory layer (agent-writable structured facts) and introduces the MOIM token-budget-adaptive injection mechanism to retrieve and insert relevant past context, including intra-session compacted turns. The system is claimed to deliver virtually unbounded effective context without any changes to the underlying model, is implemented as a provider-agnostic Rust library, and degrades gracefully to baseline LLM behavior when disabled. The manuscript describes the architecture, design decisions, performance characteristics, and trade-offs.

Significance. If the retrieval and injection components function reliably, the work offers a practical, non-intrusive engineering solution to context-window and compaction limitations in production conversational systems. The emphasis on intra-session retrieval and the pure application-layer implementation are notable strengths that distinguish it from model-modification or provider-specific approaches. However, the absence of any quantitative evaluation of retrieval accuracy, injection coherence, or continuity preservation substantially weakens the ability to assess whether the central claim holds in practice.

major comments (2)

[§5] §5 (Performance characteristics and trade-off analysis): The manuscript states that it describes performance characteristics and analyses trade-offs, yet reports no precision/recall figures, ablation results, or coherence measurements for the episodic-memory retrieval of compacted intra-session turns or for MOIM injection under varying token budgets. Because the central claim of reliable conversational continuity rests on these mechanisms surfacing relevant context without noise or loss, the lack of empirical validation is load-bearing.
[§3.2] §3.2 (Episodic memory layer): The description of sliding-window vector embeddings for intra-session retrieval does not address embedding drift after compaction or the similarity threshold and ranking method used by MOIM. Without these details or supporting measurements, it is impossible to evaluate whether the architecture actually closes the compaction-continuity gap claimed in the abstract.

minor comments (2)

[Abstract] The abstract and introduction repeatedly use the phrase 'virtually unbounded effective context' without defining what effective context length is achieved or how it is measured; a short clarifying sentence would improve precision.
Figure captions for the dual-memory and MOIM flow diagrams (if present) should explicitly label the token-budget scaling behavior and the intra-session retrieval path.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional technical details and preliminary quantitative results where the comments identify gaps.

read point-by-point responses

Referee: [§5] §5 (Performance characteristics and trade-off analysis): The manuscript states that it describes performance characteristics and analyses trade-offs, yet reports no precision/recall figures, ablation results, or coherence measurements for the episodic-memory retrieval of compacted intra-session turns or for MOIM injection under varying token budgets. Because the central claim of reliable conversational continuity rests on these mechanisms surfacing relevant context without noise or loss, the lack of empirical validation is load-bearing.

Authors: We agree that the absence of quantitative metrics for retrieval accuracy and injection coherence limits the strength of the central claims. The manuscript's §5 focuses on architectural performance characteristics (latency, memory overhead, graceful degradation) and qualitative trade-offs observed in the Rust implementation. To address the referee's concern, we have added a new subsection to §5 reporting preliminary internal evaluation results: precision/recall for episodic retrieval on compacted intra-session turns (using a set of 50 multi-turn dialogue traces), coherence ratings for MOIM-injected context under varying token budgets, and a limited ablation on the adaptive injection component. These results are presented with explicit caveats regarding their preliminary nature and the need for larger-scale user studies. We believe this revision directly responds to the load-bearing issue while preserving the paper's primary contribution as an application-layer engineering solution. revision: yes
Referee: [§3.2] §3.2 (Episodic memory layer): The description of sliding-window vector embeddings for intra-session retrieval does not address embedding drift after compaction or the similarity threshold and ranking method used by MOIM. Without these details or supporting measurements, it is impossible to evaluate whether the architecture actually closes the compaction-continuity gap claimed in the abstract.

Authors: We thank the referee for identifying this lack of implementation specificity. The original §3.2 provided a high-level overview of the sliding-window embedding approach. In the revised version we have expanded the section to include: (i) a discussion of embedding drift after compaction and our mitigation via periodic re-embedding of compacted turns; (ii) the concrete similarity threshold (cosine similarity of 0.75) and the MOIM ranking procedure (semantic similarity combined with recency bias and remaining token budget); and (iii) supporting measurements from our development test harness showing retrieval hit rates before and after compaction. These additions enable readers to assess how the intra-session retrieval mechanism addresses the compaction-continuity gap. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering architecture without derivation chain

full rationale

The paper presents CALMem as a pure application-layer architecture combining episodic memory via sliding-window vector embeddings and semantic memory with MOIM injection. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. Claims of virtually unbounded context are framed as outcomes of the described design choices and implementation trade-offs rather than results reduced to inputs by construction. No self-citations load-bear any uniqueness theorem or ansatz, and the work remains self-contained as an engineering description with no reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5786 in / 1158 out tokens · 56526 ms · 2026-05-21T02:54:30.498787+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CALMem combines two complementary memory subsystems: an episodic memory layer built on sliding-window vector embeddings of conversation history, and a semantic memory layer of agent-writable structured facts. A token-budget-adaptive injection mechanism, called the MOIM...
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

intra-session retrieval: compacted-away turns from the current session remain searchable

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 11 internal anchors

[1]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

Generating Long Sequences with Sparse Transformers

Generating long se- quences with sparse transformers.arXiv preprint arXiv:1904.10509. Howard Chen et al

work page internal anchor Pith review Pith/arXiv arXiv 1904
[3]

Gordon V

Walking down the mem- ory maze: Beyond context limit through interactive reading.arXiv preprint arXiv:2310.05029. Gordon V . Cormack, Charles L. A. Clarke, and Stefan Buettcher

work page arXiv
[4]

From local to global: A Graph RAG approach to query-focused summariza- tion.arXiv preprint arXiv:2404.16130. Qdrant

work page internal anchor Pith review Pith/arXiv arXiv
[5]

J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y

HippoRAG: Neu- robiologically inspired long-term memory for large language models.arXiv preprint arXiv:2405.14831. Vladimir Karpukhin et al

work page arXiv
[6]

MTEB: Massive Text Embedding Benchmark

MTEB: Mas- sive text embedding benchmark.arXiv preprint arXiv:2210.07316. Rodrigo Nogueira and Kyunghyun Cho

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Pas- sage re-ranking with BERT.arXiv preprint arXiv:1901.04085. OpenAI

work page internal anchor Pith review Pith/arXiv arXiv 1901
[8]

MemGPT: Towards LLMs as Operating Systems

MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560. Joon Sung Park et al

work page internal anchor Pith review Pith/arXiv arXiv
[9]

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

RAPTOR: Recursive abstrac- tive processing for tree-organized retrieval.arXiv preprint arXiv:2401.18059. Theodore R. Sumers et al

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Cognitive Architectures for Language Agents

Cognitive ar- chitectures for language agents.arXiv preprint arXiv:2309.02427. Nandan Thakur et al

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Voyager: An Open-Ended Embodied Agent with Large Language Models

V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291. Jeff Wu et al

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Recursively Summarizing Books with Human Feedback

Recursively summarizing books with human feedback.arXiv preprint arXiv:2109.10862. Hao Xu et al

work page internal anchor Pith review Pith/arXiv arXiv
[13]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

MemoryBank: Enhanc- ing large language models with long-term memory. arXiv preprint arXiv:2305.10250

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever

work page internal anchor Pith review Pith/arXiv arXiv 2004

[2] [2]

Generating Long Sequences with Sparse Transformers

Generating long se- quences with sparse transformers.arXiv preprint arXiv:1904.10509. Howard Chen et al

work page internal anchor Pith review Pith/arXiv arXiv 1904

[3] [3]

Gordon V

Walking down the mem- ory maze: Beyond context limit through interactive reading.arXiv preprint arXiv:2310.05029. Gordon V . Cormack, Charles L. A. Clarke, and Stefan Buettcher

work page arXiv

[4] [4]

From local to global: A Graph RAG approach to query-focused summariza- tion.arXiv preprint arXiv:2404.16130. Qdrant

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y

HippoRAG: Neu- robiologically inspired long-term memory for large language models.arXiv preprint arXiv:2405.14831. Vladimir Karpukhin et al

work page arXiv

[6] [6]

MTEB: Massive Text Embedding Benchmark

MTEB: Mas- sive text embedding benchmark.arXiv preprint arXiv:2210.07316. Rodrigo Nogueira and Kyunghyun Cho

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Pas- sage re-ranking with BERT.arXiv preprint arXiv:1901.04085. OpenAI

work page internal anchor Pith review Pith/arXiv arXiv 1901

[8] [8]

MemGPT: Towards LLMs as Operating Systems

MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560. Joon Sung Park et al

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

RAPTOR: Recursive abstrac- tive processing for tree-organized retrieval.arXiv preprint arXiv:2401.18059. Theodore R. Sumers et al

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Cognitive Architectures for Language Agents

Cognitive ar- chitectures for language agents.arXiv preprint arXiv:2309.02427. Nandan Thakur et al

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Voyager: An Open-Ended Embodied Agent with Large Language Models

V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291. Jeff Wu et al

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Recursively Summarizing Books with Human Feedback

Recursively summarizing books with human feedback.arXiv preprint arXiv:2109.10862. Hao Xu et al

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

MemoryBank: Enhanc- ing large language models with long-term memory. arXiv preprint arXiv:2305.10250

work page internal anchor Pith review Pith/arXiv arXiv