A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

Lin Hujite; Li Yingshuo; Tu Haomiao; Wang Huan; Zhao Yang

arxiv: 2605.14802 · v1 · pith:ZOHG4JEDnew · submitted 2026-05-14 · 💻 cs.AI

A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

Zhao Yang , Wang Huan , Li Yingshuo , Tu Haomiao , Lin Hujite This is my paper

Pith reviewed 2026-06-30 20:38 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM persona consistencytemporal memory governancelong-term dialogueexternal memory frameworkcontext clearingmulti-model handoffretrieval fusionnoise robustness

0 comments

The pith

An external memory framework using retrieval fusion and verification protocols maintains semantic, boundary, and persona continuity in LLMs despite 5.1 million noise characters, periodic context clearing, and model handoffs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ARPM to handle fact loss, timeline confusion, and persona drift during extended LLM interactions. It separates static knowledge memory from dynamic dialogue experience memory and layers vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological reading, and a controlled analysis protocol on top. Experiments compare noise levels, test component ablations, and run the full system under heavy noise with resets and handoffs. The results indicate that continuity breaks into governable, auditable parts that transfer across models instead of depending on internal weights or context length alone. A sympathetic reader would care because this turns stability from an opaque model property into a traceable engineering task.

Core claim

ARPM treats continuity as a traceable, auditable, and transferable governance problem rather than encoding it into model weights or relying solely on long context. The framework separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Under a 5.1-million-character noise substrate with periodic context clearing and multi-model handoff, the system maintains semantic continuity, boundary continuity, and persona consistency while exposing limits from weak protocol compliance.

What carries the argument

ARPM, the external heterogeneous temporal memory governance framework that separates static knowledge from dynamic dialogue memory and fuses retrieval methods with a verification protocol to enforce traceable continuity.

If this is right

Dialogue history retrieval is necessary for recent continuity; disabling it reduces strict accuracy from 100% to 66.7%.
BM25 retrieval is required alongside semantic methods; disabling it drops strict accuracy to 80%.
Automatic CSV judgment underestimates recall accuracy relative to manual review, with gaps reaching 46 points at 1:5 noise and 36 points at 1:200+ noise.
Long-term persona consistency decomposes into separable, white-box evaluable components rather than remaining an opaque model behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

External governance could reduce the need for repeated fine-tuning when consistency must survive model updates or deployment changes.
The same separation of memory types and verification steps might apply to maintaining factual timelines or multi-agent coordination beyond persona.
Stronger enforcement of the analysis protocol could close the remaining limits the paper observes under weak compliance.

Load-bearing premise

The controlled analysis protocol for evidence verification and answer binding can be followed reliably enough to support the claimed continuity in high-noise settings.

What would settle it

A run in the 5.1-million-character noise substrate with periodic clearing and handoffs where manual review shows loss of semantic continuity, boundary continuity, or persona consistency would falsify the maintenance claim.

read the original abstract

Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long-range interaction, especially under high-noise knowledge bases, context clearing, and cross-model transfer. To address these issues, we introduce ARPM, an external temporal memory governance framework for long-term dialogue. ARPM separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Unlike approaches that encode persona consistency into model weights or rely only on long context, ARPM treats continuity as a traceable, auditable, and transferable governance problem. Using engineering logs, we conduct three experiments. First, in a 50-round question-answering setting, we compare signal-to-noise ratios of 1:5 and 1:200+, and distinguish CSV auto-judgment from manual review. Under 1:5, CSV recall accuracy is 54.0%, while manual review raises it to 100.0%. Under 1:200+, the values are 44.0% and 80.0%. These results show that automatic rules can underestimate recall after supporting evidence enters the prompt. Second, ablation results show that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%, and disabling BM25 reduces it to 80.0%, indicating that pure semantic retrieval is insufficient for correction and tracing. Third, under a 5.1-million-character noise substrate, periodic context clearing, and multi-model handoff, ARPM maintains semantic continuity, boundary continuity, and persona consistency, while exposing limits caused by weak protocol compliance. These findings show that long-term persona consistency can be decomposed into governable components and evaluated in a white-box manner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARPM packages standard retrieval tools into an external memory layer for LLM persona stability and honestly flags auto-evaluation gaps, but its key high-noise claim lacks solid protocol metrics.

read the letter

ARPM is an external memory system that splits static knowledge from dynamic dialogue, then layers vector retrieval, BM25, RRF fusion, dual-temporal reranking, and a controlled analysis protocol on top. The goal is traceable continuity under noise, context clearing, and model handoff instead of relying on model weights or raw long context.

The work is new mainly in the specific combination and the white-box framing. It does a clear job on the first two experiments. The ablation shows dialogue history retrieval is required for recent continuity (strict accuracy falls from 100% to 66.7% when disabled) and that BM25 adds value beyond pure semantic search (drops to 80%). The signal-to-noise tests also usefully separate CSV auto-judgment (44-54%) from manual review (80-100%), which demonstrates that automatic rules can underestimate recall once evidence reaches the prompt.

The soft spot is the third experiment. The claim that ARPM maintains semantic, boundary, and persona continuity under a 5.1-million-character noise substrate rests on the controlled analysis protocol being followed reliably. The paper reports this via engineering logs, yet the earlier results already expose large gaps between automated and manual checks. No inter-rater agreement, blinded scoring, or pre-specified quantitative metrics for protocol compliance appear in the high-noise case, so the continuity result is harder to trust.

This is for engineers who need a practical, auditable memory layer for production chat systems. Readers working on retrieval-augmented consistency will pick up usable component tests; readers wanting rigorous evidence or a foundational method will find the evaluation thin.

It deserves a serious referee. The problem is real, the experiments are concrete, and the authors surface their own evaluation weaknesses rather than hiding them. Referees can press on the protocol metrics without the paper being dismissed outright.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ARPM, an external heterogeneous temporal memory governance framework for long-term LLM persona consistency. It separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol. Three experiments using engineering logs are reported: (1) 50-round QA under 1:5 and 1:200+ signal-to-noise ratios showing CSV auto-judgment recall of 44-54% vs. manual review of 80-100%; (2) ablations indicating dialogue history retrieval and BM25 are necessary for strict accuracy; (3) under 5.1M-character noise, periodic context clearing, and multi-model handoff, ARPM maintains semantic, boundary, and persona continuity while exposing protocol compliance limits.

Significance. If the continuity claims hold under rigorous validation, ARPM provides a practical, auditable, white-box alternative to weight-encoded or long-context approaches for long-term dialogue stability, with potential value for extended multi-turn applications. The explicit decomposition into traceable components and the use of ablation-style tests are strengths, but the reliance on engineering logs without standardized metrics limits broader significance.

major comments (3)

[Abstract (third experiment)] Abstract, third experiment description: The central claim that ARPM maintains semantic continuity, boundary continuity, and persona consistency under a 5.1-million-character noise substrate rests on reliable execution of the controlled analysis protocol for evidence verification and answer binding, yet the manuscript reports this only via engineering logs without pre-specified quantitative metrics, inter-rater agreement scores, or blinded review procedures.
[Abstract (first experiment)] Abstract, first experiment: The reported gaps between CSV auto-judgment (54.0% and 44.0%) and manual review (100.0% and 80.0%) under differing signal-to-noise ratios indicate that protocol compliance itself is noisy; nothing demonstrates that the same protocol remains stable or auditable when evidence is buried in 5.1M characters of noise as claimed in the third experiment.
[Abstract (ablation results)] Abstract, ablation results: The ablation findings (dialogue history retrieval necessary for 100% to 66.7% strict accuracy drop; BM25 for 100% to 80.0%) are presented without details on trial count, variance, statistical testing, or how the controlled analysis protocol was applied during ablations, weakening the support for component necessity.

minor comments (2)

The manuscript lacks explicit baselines or comparisons to prior methods for long-term consistency (e.g., memory-augmented LLMs or persona fine-tuning), which would help situate the contribution.
Provide more detail on the exact implementation of the controlled analysis protocol, including decision criteria for evidence verification and answer binding, to support reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below with clarifications on our methodology and indicate where revisions will be made to improve transparency and rigor.

read point-by-point responses

Referee: Abstract, third experiment description: The central claim that ARPM maintains semantic continuity, boundary continuity, and persona consistency under a 5.1-million-character noise substrate rests on reliable execution of the controlled analysis protocol for evidence verification and answer binding, yet the manuscript reports this only via engineering logs without pre-specified quantitative metrics, inter-rater agreement scores, or blinded review procedures.

Authors: We acknowledge that the third experiment relies on post-hoc analysis of engineering logs rather than a pre-registered study with quantitative metrics or blinded review. The controlled analysis protocol (detailed in Section 3.4) specifies chronological evidence reading and answer binding steps applied to retrieved logs. We agree this limits claims of full auditability at scale. We will revise the abstract and add a dedicated subsection on protocol execution, including any available agreement measures from log review, while explicitly stating the engineering-log constraints and absence of blinded procedures. revision: partial
Referee: Abstract, first experiment: The reported gaps between CSV auto-judgment (54.0% and 44.0%) and manual review (100.0% and 80.0%) under differing signal-to-noise ratios indicate that protocol compliance itself is noisy; nothing demonstrates that the same protocol remains stable or auditable when evidence is buried in 5.1M characters of noise as claimed in the third experiment.

Authors: The first experiment quantifies the gap between automated and manual judgment to motivate the protocol's manual verification component. In the third experiment, the same protocol (vector+BM25 retrieval, dual-temporal reranking, then chronological evidence reading) was applied to focus analysis on relevant logs within the 5.1M-character substrate, with context clearing and model handoff simulated. We will add text clarifying how the first experiment's findings informed the third experiment's design and how retrieval steps reduce effective noise exposure, while noting that full manual review of the entire substrate was not feasible. revision: partial
Referee: Abstract, ablation results: The ablation findings (dialogue history retrieval necessary for 100% to 66.7% strict accuracy drop; BM25 for 100% to 80.0%) are presented without details on trial count, variance, statistical testing, or how the controlled analysis protocol was applied during ablations, weakening the support for component necessity.

Authors: We agree the ablation reporting lacks sufficient methodological detail. The ablations were run across three independent trials per condition using the controlled analysis protocol for accuracy measurement. We will expand both the abstract and methods section to report trial counts, observed variance, the exact protocol application steps, and the absence of formal statistical testing due to sample size, thereby strengthening the evidence for component contributions. revision: yes

standing simulated objections not resolved

Retrospective introduction of blinded review or pre-specified quantitative metrics is not possible for the existing engineering logs without new data collection.

Circularity Check

0 steps flagged

No significant circularity; experimental claims do not reduce to self-definitional or fitted inputs

full rationale

The paper describes an external memory governance framework (ARPM) and reports outcomes from three experiments using engineering logs, with no equations, derivations, parameters fitted to subsets then renamed as predictions, or self-citations invoked to justify uniqueness theorems or ansatzes. The central claims about continuity under noise rest on described protocol application rather than any reduction by construction to prior inputs or self-referential definitions. This matches the default expectation of non-circularity for papers without mathematical derivation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the high-level framework description; full text required for complete ledger.

axioms (2)

domain assumption Retrieval methods (vector, BM25, RRF) plus dual-temporal reranking suffice to recover relevant dialogue history for continuity
Core design choice invoked without derivation in the abstract.
domain assumption The controlled analysis protocol enables reliable evidence verification and answer binding
Invoked in the final experiment description as necessary for the continuity claim.

invented entities (1)

ARPM (Heterogeneous Temporal Memory Governance Framework) no independent evidence
purpose: External system separating static knowledge memory from dynamic dialogue experience memory for LLM persona consistency
Main contribution introduced in the abstract; no independent evidence outside the paper.

pith-pipeline@v0.9.1-grok · 5880 in / 1459 out tokens · 39822 ms · 2026-06-30T20:38:49.209839+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Practice Auditing Framework for Large Language Model Use: Collective Empiricism, Pseudo-Rational Cognition, and Governance of AI-Generated Content
cs.CY 2026-06 unverdicted novelty 4.0

This paper proposes a conceptual auditing framework for LLM interactions to mitigate risks from mistaking AI-generated content for empirical knowledge.

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks[C]//Advances in Neural Information Processing Systems 33

LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks[C]//Advances in Neural Information Processing Systems 33. 2020: 9459-9474

2020
[2]

The Probabilistic Relevance Framework: BM25 and Beyond[J]

ROBERTSON S, ZARAGOZA H. The Probabilistic Relevance Framework: BM25 and Beyond[J]. Foundations and Trends in Information Retrieval, 2009, 3(4): 333-389

2009
[3]

CORMACK G V, CLARKE C L A, B ¨UTTCHER S. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods[C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2009: 758-759

2009
[4]

MemGPT: Towards LLMs as Operating Systems

PACKER C, WOODERS S, LIN K, et al. MemGPT: Towards LLMs as Operating Systems[EB/OL]. arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

MemoryBank: Enhancing Large Language Models with Long-Term Memory[C]//Proceedings of the AAAI Conference on Artificial Intelligence

ZHONG W, GUO L, GAO Q, et al. MemoryBank: Enhancing Large Language Models with Long-Term Memory[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(17): 19724-19731

2024
[6]

Generative Agents: Interactive Simulacra of Human Behavior[C]//Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

PARK J S, O’BRIEN J, CAI C J, et al. Generative Agents: Interactive Simulacra of Human Behavior[C]//Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 2023: 1-22

2023
[7]

Evaluating Very Long-Term Conversational Memory of LLM Agents

MAHARANA A, LEE D-H, TULYAKOV S, et al. Evaluating Very Long-Term Conversational Memory of LLM Agents[EB/OL]. arXiv:2402.17753, 2024. 21

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

WU D, WANG H, YU W, et al. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory[EB/OL]. arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

ZHANG S, DINAN E, URBANEK J, et al. Personalizing Dialogue Agents: I Have a Dog, Do You Have Pets Too?[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2204-2213

2018
[10]

Profile Consistency Identification for Open-domain Dialogue Agents[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

SONG H, WANG Y, ZHANG W-N, et al. Profile Consistency Identification for Open-domain Dialogue Agents[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 6651-6662

2020
[11]

Attention Is All You Need[C]//Advances in Neural Information Processing Systems 30

VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[C]//Advances in Neural Information Processing Systems 30. 2017: 5998- 6008

2017
[12]

REALM: Retrieval-Augmented Language Model Pre-Training[C]//Proceedings of the 37th International Conference on Machine Learning

GUU K, LEE K, TUNG Z, et al. REALM: Retrieval-Augmented Language Model Pre-Training[C]//Proceedings of the 37th International Conference on Machine Learning. 2020: 3929-3938

2020
[13]

Dense Passage Retrieval for Open-Domain Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

KARPUKHIN V, O ˘GUZ B, MIN S, et al. Dense Passage Retrieval for Open-Domain Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 6769-6781

2020
[14]

IZACARD G, GRAVE E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. 2021: 874-880

2021
[15]

The Faiss library

DOUZE M, GUZHVA A, DENG C, et al. The Faiss Library[EB/OL]. arXiv:2401.08281, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Lost in the Middle: How Language Models Use Long Contexts[J]

LIU N F, LIN K, HEWITT J, et al. Lost in the Middle: How Language Models Use Long Contexts[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 157-173

2024
[17]

Chain-of-Thought Prompting Elic- its Reasoning in Large Language Models[C]//Advances in Neural Information Processing Systems 35

WEI J, WANG X, SCHUURMANS D, et al. Chain-of-Thought Prompting Elic- its Reasoning in Large Language Models[C]//Advances in Neural Information Processing Systems 35. 2022: 24824-24837

2022
[18]

Self-Consistency Improves Chain of Thought Reasoning in Language Models[C]//International Conference on Learning Representations

WANG X, WEI J, SCHUURMANS D, et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models[C]//International Conference on Learning Representations. 2023

2023
[19]

ReAct: Synergizing Reasoning and Acting in Language Models[C]//International Conference on Learning Representations

YAO S, ZHAO J, YU D, et al. ReAct: Synergizing Reasoning and Acting in Language Models[C]//International Conference on Learning Representations. 2023. 22

2023
[20]

Self-Refine: Iterative Refinement with Self-Feedback[C]//Advances in Neural Information Processing Systems 36

MADAAN A, TANDON N, GUPTA P, et al. Self-Refine: Iterative Refinement with Self-Feedback[C]//Advances in Neural Information Processing Systems 36. 2023

2023
[21]

Dialogue Natural Language Inference[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

WELLECK S, WESTON J, SZLAM A, et al. Dialogue Natural Language Inference[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3731-3741

2019
[22]

Toolformer: Language Models Can Teach Themselves to Use Tools[C]//Advances in Neural Information Processing Systems 36

SCHICK T, DWIVEDI-YU J, DESS`I R, et al. Toolformer: Language Models Can Teach Themselves to Use Tools[C]//Advances in Neural Information Processing Systems 36. 2023

2023
[23]

Retrieval-Augmented Generation for Large Language Models: A Survey

GAO Y, XIONG Y, GAO X, et al. Retrieval-Augmented Generation for Large Language Models: A Survey[EB/OL]. arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Reading Wikipedia to Answer Open-Domain Questions[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics

CHEN D, FISCH A, WESTON J, et al. Reading Wikipedia to Answer Open-Domain Questions[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1870-1879

2017
[25]

From Eliza to XiaoIce: Challenges and Opportuni- ties with Social Chatbots[J]

SHUM H Y, HE X, LI D. From Eliza to XiaoIce: Challenges and Opportuni- ties with Social Chatbots[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(1): 10-26. 23

2018

[1] [1]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks[C]//Advances in Neural Information Processing Systems 33

LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks[C]//Advances in Neural Information Processing Systems 33. 2020: 9459-9474

2020

[2] [2]

The Probabilistic Relevance Framework: BM25 and Beyond[J]

ROBERTSON S, ZARAGOZA H. The Probabilistic Relevance Framework: BM25 and Beyond[J]. Foundations and Trends in Information Retrieval, 2009, 3(4): 333-389

2009

[3] [3]

CORMACK G V, CLARKE C L A, B ¨UTTCHER S. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods[C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2009: 758-759

2009

[4] [4]

MemGPT: Towards LLMs as Operating Systems

PACKER C, WOODERS S, LIN K, et al. MemGPT: Towards LLMs as Operating Systems[EB/OL]. arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

MemoryBank: Enhancing Large Language Models with Long-Term Memory[C]//Proceedings of the AAAI Conference on Artificial Intelligence

ZHONG W, GUO L, GAO Q, et al. MemoryBank: Enhancing Large Language Models with Long-Term Memory[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(17): 19724-19731

2024

[6] [6]

Generative Agents: Interactive Simulacra of Human Behavior[C]//Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

PARK J S, O’BRIEN J, CAI C J, et al. Generative Agents: Interactive Simulacra of Human Behavior[C]//Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 2023: 1-22

2023

[7] [7]

Evaluating Very Long-Term Conversational Memory of LLM Agents

MAHARANA A, LEE D-H, TULYAKOV S, et al. Evaluating Very Long-Term Conversational Memory of LLM Agents[EB/OL]. arXiv:2402.17753, 2024. 21

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

WU D, WANG H, YU W, et al. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory[EB/OL]. arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

ZHANG S, DINAN E, URBANEK J, et al. Personalizing Dialogue Agents: I Have a Dog, Do You Have Pets Too?[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2204-2213

2018

[10] [10]

Profile Consistency Identification for Open-domain Dialogue Agents[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

SONG H, WANG Y, ZHANG W-N, et al. Profile Consistency Identification for Open-domain Dialogue Agents[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 6651-6662

2020

[11] [11]

Attention Is All You Need[C]//Advances in Neural Information Processing Systems 30

VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[C]//Advances in Neural Information Processing Systems 30. 2017: 5998- 6008

2017

[12] [12]

REALM: Retrieval-Augmented Language Model Pre-Training[C]//Proceedings of the 37th International Conference on Machine Learning

GUU K, LEE K, TUNG Z, et al. REALM: Retrieval-Augmented Language Model Pre-Training[C]//Proceedings of the 37th International Conference on Machine Learning. 2020: 3929-3938

2020

[13] [13]

Dense Passage Retrieval for Open-Domain Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

KARPUKHIN V, O ˘GUZ B, MIN S, et al. Dense Passage Retrieval for Open-Domain Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 6769-6781

2020

[14] [14]

IZACARD G, GRAVE E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. 2021: 874-880

2021

[15] [15]

The Faiss library

DOUZE M, GUZHVA A, DENG C, et al. The Faiss Library[EB/OL]. arXiv:2401.08281, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Lost in the Middle: How Language Models Use Long Contexts[J]

LIU N F, LIN K, HEWITT J, et al. Lost in the Middle: How Language Models Use Long Contexts[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 157-173

2024

[17] [17]

Chain-of-Thought Prompting Elic- its Reasoning in Large Language Models[C]//Advances in Neural Information Processing Systems 35

WEI J, WANG X, SCHUURMANS D, et al. Chain-of-Thought Prompting Elic- its Reasoning in Large Language Models[C]//Advances in Neural Information Processing Systems 35. 2022: 24824-24837

2022

[18] [18]

Self-Consistency Improves Chain of Thought Reasoning in Language Models[C]//International Conference on Learning Representations

WANG X, WEI J, SCHUURMANS D, et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models[C]//International Conference on Learning Representations. 2023

2023

[19] [19]

ReAct: Synergizing Reasoning and Acting in Language Models[C]//International Conference on Learning Representations

YAO S, ZHAO J, YU D, et al. ReAct: Synergizing Reasoning and Acting in Language Models[C]//International Conference on Learning Representations. 2023. 22

2023

[20] [20]

Self-Refine: Iterative Refinement with Self-Feedback[C]//Advances in Neural Information Processing Systems 36

MADAAN A, TANDON N, GUPTA P, et al. Self-Refine: Iterative Refinement with Self-Feedback[C]//Advances in Neural Information Processing Systems 36. 2023

2023

[21] [21]

Dialogue Natural Language Inference[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

WELLECK S, WESTON J, SZLAM A, et al. Dialogue Natural Language Inference[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3731-3741

2019

[22] [22]

Toolformer: Language Models Can Teach Themselves to Use Tools[C]//Advances in Neural Information Processing Systems 36

SCHICK T, DWIVEDI-YU J, DESS`I R, et al. Toolformer: Language Models Can Teach Themselves to Use Tools[C]//Advances in Neural Information Processing Systems 36. 2023

2023

[23] [23]

Retrieval-Augmented Generation for Large Language Models: A Survey

GAO Y, XIONG Y, GAO X, et al. Retrieval-Augmented Generation for Large Language Models: A Survey[EB/OL]. arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Reading Wikipedia to Answer Open-Domain Questions[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics

CHEN D, FISCH A, WESTON J, et al. Reading Wikipedia to Answer Open-Domain Questions[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1870-1879

2017

[25] [25]

From Eliza to XiaoIce: Challenges and Opportuni- ties with Social Chatbots[J]

SHUM H Y, HE X, LI D. From Eliza to XiaoIce: Challenges and Opportuni- ties with Social Chatbots[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(1): 10-26. 23

2018