Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

Eunseo Jung; Jaehyeok Kim; Jiachuan Wang; Ka Chun Cheung; Lei Chen; Simon See; Yeongseo Jung; Yongqi Zhang

arxiv: 2606.12411 · v1 · pith:7PTOXIPInew · submitted 2026-06-10 · 💻 cs.CL · cs.LG

Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

Yeongseo Jung , Jaehyeok Kim , Eunseo Jung , Jiachuan Wang , Yongqi Zhang , Ka Chun Cheung , Simon See , Lei Chen This is my paper

Pith reviewed 2026-06-27 09:49 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords context compressionincremental compressionmulti-turn dialoguedialogue memoryconversational agentscontext managementlong-context modeling

0 comments

The pith

Context-Driven Incremental Compression keeps dialogue models stable by revising per-thread states in a shared memory at each turn.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conversational models face rising compute and accuracy costs as dialogue history lengthens, since full context must be re-encoded each turn. Truncation or static summarization loses fidelity while prior compressors cannot share or correct information across turns. The paper introduces Context-Driven Incremental Compression, which decomposes conversations into interleaved threads and maintains revisable compression states inside one compact dialogue memory. A lightweight retrieve-revise-write-back loop updates those states turn by turn, and an adapted truncated backpropagation-through-time trains cross-turn links without full-history gradients. On long-form benchmarks the method holds both latency and perplexity steady across hundreds of turns.

Core claim

C-DIC treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single compact dialogue memory. At each turn a retrieve-revise-write-back loop shares information across turns and updates stale memories, while an adaptation of truncated backpropagation-through-time learns cross-turn dependencies without full-history backpropagation.

What carries the argument

The per-thread compression states held in a compact dialogue memory and updated by a retrieve-revise-write-back loop.

If this is right

Inference latency remains constant rather than scaling with total history length.
Perplexity stays flat across hundreds of dialogue turns instead of degrading.
Cross-turn dependencies can be learned with truncated backpropagation-through-time rather than full-history gradients.
The same memory mechanism yields higher scores than baselines on long-form dialogue benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The thread-based revision pattern could apply to any incremental generation task where context accumulates, such as long-document editing or multi-step planning.
Explicit decomposition into revisable threads may reduce error propagation in other memory-augmented models that currently rely on monolithic summaries.
Stable long-horizon behavior without periodic resets would let deployed dialogue systems maintain quality in extended sessions without external summarization steps.

Load-bearing premise

Conversations can be decomposed into interleaved contextual threads whose compression states can be revised incrementally without information loss or compounding errors across turns.

What would settle it

An experiment on a long dialogue benchmark in which C-DIC perplexity or latency begins to rise after roughly 200 turns would falsify the stability claim.

Figures

Figures reproduced from arXiv: 2606.12411 by Eunseo Jung, Jaehyeok Kim, Jiachuan Wang, Ka Chun Cheung, Lei Chen, Simon See, Yeongseo Jung, Yongqi Zhang.

**Figure 1.** Figure 1: Long-horizon reference tracking. After 196 intervening turns, the user asks how many dinner parties they attended in the past month. Baselines fail to recover earlier mentions, while C-DIC retrieves thread-relevant memories and answers correctly. 1. Introduction Conversational agents powered by large language models (LLMs), such as CHATGPT (OpenAI, 2022) and Gemini (Gemini Team, 2025), have emerged as ubi… view at source ↗

**Figure 2.** Figure 2: Static compression collapses under multi-turn conversation; C-DIC remains stable. Perplexity (↓) for ICAE (zero-shot), ICAE (MSC-tuned), and our method. (a) Static baselines rise sharply after 3-4 consecutive compressions. (b) Moving from single-turn (one-shot) to multi-turn evaluation, perplexity for static models explodes by at least ∼1900% while C-DIC decreases by 70%. et al., 2023; Tay et al., 2022). S… view at source ↗

**Figure 3.** Figure 3: Overview of retrieval-conditioned incremental compression and ra-TBPTT. Left: We maintain memories M that store a set of compressed thread states that evolve turn by turn. The memories are initialized by compressing the instruction into a thread state. Each turn then follows three steps. (1) Query: given qt, we score the existing thread states and retrieve the relevant states; if the best matching thread i… view at source ↗

**Figure 4.** Figure 4: Latency vs. dialogue length. Total wall-clock time (s) to process a single dialogue when the maximum number of turns is capped at {10, 20, 30, 40, 428}. Bars compare Ours, Full prompting, ICAE (append), and ICAE (one-shot). The hatched segment denotes compression time and the solid segment denotes generation time; “OOM” denotes exceeding the memory budget under the corresponding turn cap. Evaluations use R… view at source ↗

**Figure 5.** Figure 5: Latency vs. performance across turn caps. Dual–axis plot with PPL (left) and total time per dialog in seconds (right) versus the maximum number of retained turns (50-400). Evaluations use the REALTALK all-sessions setting with histories truncated to the most recent turns. and LongMemEval QA results with explicit correctness targets in Appendix M. Together, these diagnostics evaluate long-range context us… view at source ↗

**Figure 6.** Figure 6: Qualitative example #1 (LongMemEval). Multi-session dialogue in chronological order (previous sessions → current session). Dialogue Snippet Session 1∼4 . . . (14 turns) . . . Session 5 . . . (1 turn) . . . Turn #16 S1: I love these ideas! I’m definitely going to consider the Global Street Food theme. By the way, I’ve also had a great experience with a BBQ theme, like the one we had at Mike’s place two week… view at source ↗

**Figure 7.** Figure 7: Qualitative example #2 (LongMemEval). Multi-session dialogue in chronological order (previous sessions → current session). Dialogue Snippet Session 1∼11 . . . (56 turns) . . . Session 12 . . . (1 turn) . . . Turn #58 S1: I’m also thinking of exploring other platforms like Instagram and Twitter to promote my writing services. Do you have any tips on how to get started with those platforms, especially since … view at source ↗

read the original abstract

Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

C-DIC proposes per-thread revisable compression states and a retrieve-revise-write-back loop for long dialogues, but the abstract supplies no numbers to check the stability claims.

read the letter

The main thing to know is that this paper describes a thread-based incremental compression method for multi-turn dialogues, using revisable per-thread states in a compact memory plus an adapted TBPTT, but the abstract gives no quantitative results to evaluate whether it actually delivers stable latency and perplexity over hundreds of turns.

The work is new in its specific combination of treating conversations as interleaved threads, storing compression states that can be revised, and running a lightweight retrieve-revise-write-back loop at each turn to share information across turns. It also adapts truncated backpropagation-through-time to the multi-turn case so cross-turn dependencies can be learned without full-history gradients. The paper does a clear job framing the practical problem of redundant attention costs and compounding errors from naive truncation or summarization, and the per-thread memory idea is a direct attempt to address cross-turn sharing that prior compressors lack.

The soft spot is the missing evidence. The abstract asserts superior performance and stability over long horizons yet reports no metrics, baselines, or protocol, so the central claim cannot be checked. The assumption that conversations decompose cleanly into revisable threads without irreversible loss or accumulating errors is load-bearing, and without any bound on reconstruction fidelity or error growth as turns increase, it is impossible to tell if the observed stability is general or tied to the tested regime. If thread identification turns out to be heuristic or the revision step lossy, the benefit could be narrower than stated.

This is for researchers focused on efficient long-context dialogue systems. A reader working on context management for conversational agents could extract useful framing and a concrete alternative to existing baselines once the experiments are visible.

It deserves a serious referee to assess the full implementation and results.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Context-Driven Incremental Compression (C-DIC) to address growing attention costs in long multi-turn dialogues. It models conversations as interleaved contextual threads whose compression states are stored in a compact dialogue memory and updated via a retrieve-revise-write-back loop; an adaptation of truncated backpropagation-through-time (TBPTT) is used to learn cross-turn dependencies. The central claim is that this yields superior efficiency and robustness relative to prior compressors, with the key empirical result being stable inference latency and perplexity over hundreds of dialogue turns on long-form benchmarks.

Significance. If the stability result holds without compounding reconstruction error, the method would provide a concrete mechanism for maintaining information across very long dialogues while keeping memory and latency bounded, addressing a practical bottleneck in current dialogue systems.

major comments (2)

[Abstract] Abstract: the claim that C-DIC 'shows stable inference latency and perplexity over hundreds of dialogue turns' is presented without any quantitative results, baselines, metrics, tables, or experimental protocol, so the central empirical assertion cannot be evaluated.
[Abstract] Abstract: no quantitative bound, plot, or analysis of reconstruction fidelity or error growth rate versus turn count is supplied, leaving the load-bearing assumption that the revise loop incurs 'neither irreversible loss nor accumulating errors' untested.

minor comments (1)

[Abstract] The abstract introduces 'Dialogue memory storing per-thread compression states' and 'TBPTT adaptation' without a brief definition or reference to the section where they are formalized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for stronger quantitative grounding in the abstract. We address each major comment below and outline targeted revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that C-DIC 'shows stable inference latency and perplexity over hundreds of dialogue turns' is presented without any quantitative results, baselines, metrics, tables, or experimental protocol, so the central empirical assertion cannot be evaluated.

Authors: The abstract is a high-level summary; the full quantitative results—including measured latency and perplexity values across turn counts, baseline comparisons, metrics, and the experimental protocol—are reported in Section 4 with tables and figures. We will revise the abstract to include a small number of key quantitative highlights (e.g., specific stability ranges) to make the central claim more self-contained while respecting length constraints. revision: partial
Referee: [Abstract] Abstract: no quantitative bound, plot, or analysis of reconstruction fidelity or error growth rate versus turn count is supplied, leaving the load-bearing assumption that the revise loop incurs 'neither irreversible loss nor accumulating errors' untested.

Authors: We agree that an explicit analysis of reconstruction fidelity and error growth versus turn count would strengthen support for the revise loop. While empirical stability of perplexity serves as indirect validation, we will add a dedicated plot and quantitative bounds on error accumulation in the revised manuscript to directly test the claim of no compounding errors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no self-referential derivations

full rationale

The paper introduces C-DIC as an algorithmic approach to incremental context compression via retrieve-revise-write-back and TBPTT adaptation, evaluated empirically on dialogue benchmarks. No equations, fitted parameters, or first-principles derivations are presented that reduce reported stability or performance to a tautological fit or self-definition. Claims rest on experimental outcomes rather than any load-bearing self-citation chain or ansatz smuggled via prior work. The derivation chain is self-contained as a proposed engineering solution whose validity is tested externally to its own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that conversations admit a thread decomposition whose states can be revised without loss, plus the effectiveness of the retrieve-revise-write-back loop; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption A conversation can be treated as interleaved contextual threads whose compression states remain revisable across turns.
This decomposition is required for the per-thread memory and cross-turn sharing to function as described.

invented entities (1)

Dialogue memory storing per-thread compression states no independent evidence
purpose: Compact storage enabling retrieve-revise-write-back without full history re-encoding.
New data structure introduced by the method; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.1-grok · 5740 in / 1264 out tokens · 22928 ms · 2026-06-27T09:49:50.188952+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 5 canonical work pages

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[4]

2025 , eprint=

LLMs Get Lost In Multi-Turn Conversation , author=. 2025 , eprint=

2025
[5]

2024 , eprint=

In-context Autoencoder for Context Compression in a Large Language Model , author=. 2024 , eprint=

2024
[6]

2023 , eprint=

Adapting Language Models to Compress Contexts , author=. 2023 , eprint=

2023
[7]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

2023
[8]

2022 , eprint=

Efficient Transformers: A Survey , author=. 2022 , eprint=

2022
[9]

2025 , eprint=

Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models , author=. 2025 , eprint=

2025
[10]

2024 , eprint=

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems , author=. 2024 , eprint=

2024
[11]

Beyond Goldfish Memory: Long-Term Open-Domain Conversation

Xu, Jing and Szlam, Arthur and Weston, Jason. Beyond Goldfish Memory: Long-Term Open-Domain Conversation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.356

work page doi:10.18653/v1/2022.acl-long.356 2022
[12]

2023 , eprint=

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author=. 2023 , eprint=

2023
[13]

On Context Utilization in Summarization with Large Language Models

Ravaut, Mathieu and Sun, Aixin and Chen, Nancy and Joty, Shafiq. On Context Utilization in Summarization with Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.153

work page doi:10.18653/v1/2024.acl-long.153 2024
[14]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

2024
[15]

2025 , eprint=

REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation , author=. 2025 , eprint=

2025
[16]

Bleu: a method for automatic evaluation of machine translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , title =. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =. 2002 , publisher =. doi:10.3115/1073083.1073135 , abstract =

work page doi:10.3115/1073083.1073135 2002
[17]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

2004
[18]

2021 , eprint=

The Power of Scale for Parameter-Efficient Prompt Tuning , author=. 2021 , eprint=

2021
[19]

P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks

Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie. P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.8

work page doi:10.18653/v1/2022.acl-short.8 2022
[20]

2022 , eprint=

Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models , author=. 2022 , eprint=

2022
[21]

2024 , eprint=

Learning to Compress Prompts with Gist Tokens , author=. 2024 , eprint=

2024
[22]

2025 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=

2025
[23]

Introducing ChatGPT , year =
[24]

, journal=

Werbos, P.J. , journal=. Backpropagation through time: what it does and how to do it , year=
[25]

2019 , eprint=

Recurrent Neural Networks (RNNs): A gentle Introduction and Overview , author=. 2019 , eprint=

2019
[26]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023
[27]

2019 , eprint=

Compressive Transformers for Long-Range Sequence Modelling , author=. 2019 , eprint=

2019
[28]

2022 , eprint=

Recurrent Memory Transformer , author=. 2022 , eprint=

2022
[29]

2023 , eprint=

Automated Annotation with Generative AI Requires Validation , author=. 2023 , eprint=

2023
[30]

The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s

Calderon, Nitay and Reichart, Roi and Dror, Rotem. The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.782

work page doi:10.18653/v1/2025.acl-long.782 2025
[31]

2025 , eprint=

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author=. 2025 , eprint=

2025
[32]

2023 , eprint=

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models , author=. 2023 , eprint=

2023
[33]

2024 , eprint=

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory , author=. 2024 , eprint=

2024
[34]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

2021

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[4] [4]

2025 , eprint=

LLMs Get Lost In Multi-Turn Conversation , author=. 2025 , eprint=

2025

[5] [5]

2024 , eprint=

In-context Autoencoder for Context Compression in a Large Language Model , author=. 2024 , eprint=

2024

[6] [6]

2023 , eprint=

Adapting Language Models to Compress Contexts , author=. 2023 , eprint=

2023

[7] [7]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

2023

[8] [8]

2022 , eprint=

Efficient Transformers: A Survey , author=. 2022 , eprint=

2022

[9] [9]

2025 , eprint=

Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models , author=. 2025 , eprint=

2025

[10] [10]

2024 , eprint=

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems , author=. 2024 , eprint=

2024

[11] [11]

Beyond Goldfish Memory: Long-Term Open-Domain Conversation

Xu, Jing and Szlam, Arthur and Weston, Jason. Beyond Goldfish Memory: Long-Term Open-Domain Conversation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.356

work page doi:10.18653/v1/2022.acl-long.356 2022

[12] [12]

2023 , eprint=

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author=. 2023 , eprint=

2023

[13] [13]

On Context Utilization in Summarization with Large Language Models

Ravaut, Mathieu and Sun, Aixin and Chen, Nancy and Joty, Shafiq. On Context Utilization in Summarization with Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.153

work page doi:10.18653/v1/2024.acl-long.153 2024

[14] [14]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

2024

[15] [15]

2025 , eprint=

REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation , author=. 2025 , eprint=

2025

[16] [16]

Bleu: a method for automatic evaluation of machine translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , title =. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =. 2002 , publisher =. doi:10.3115/1073083.1073135 , abstract =

work page doi:10.3115/1073083.1073135 2002

[17] [17]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

2004

[18] [18]

2021 , eprint=

The Power of Scale for Parameter-Efficient Prompt Tuning , author=. 2021 , eprint=

2021

[19] [19]

P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks

Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie. P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.8

work page doi:10.18653/v1/2022.acl-short.8 2022

[20] [20]

2022 , eprint=

Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models , author=. 2022 , eprint=

2022

[21] [21]

2024 , eprint=

Learning to Compress Prompts with Gist Tokens , author=. 2024 , eprint=

2024

[22] [22]

2025 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=

2025

[23] [23]

Introducing ChatGPT , year =

[24] [24]

, journal=

Werbos, P.J. , journal=. Backpropagation through time: what it does and how to do it , year=

[25] [25]

2019 , eprint=

Recurrent Neural Networks (RNNs): A gentle Introduction and Overview , author=. 2019 , eprint=

2019

[26] [26]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023

[27] [27]

2019 , eprint=

Compressive Transformers for Long-Range Sequence Modelling , author=. 2019 , eprint=

2019

[28] [28]

2022 , eprint=

Recurrent Memory Transformer , author=. 2022 , eprint=

2022

[29] [29]

2023 , eprint=

Automated Annotation with Generative AI Requires Validation , author=. 2023 , eprint=

2023

[30] [30]

The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s

Calderon, Nitay and Reichart, Roi and Dror, Rotem. The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.782

work page doi:10.18653/v1/2025.acl-long.782 2025

[31] [31]

2025 , eprint=

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author=. 2025 , eprint=

2025

[32] [32]

2023 , eprint=

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models , author=. 2023 , eprint=

2023

[33] [33]

2024 , eprint=

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory , author=. 2024 , eprint=

2024

[34] [34]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

2021