EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

Arnav Kundu; Han-Byul Kim; Minsik Cho; Minsoo Kim; Richa Dixit

arxiv: 2509.17396 · v4 · pith:CXMDNPXOnew · submitted 2025-09-22 · 💻 cs.CL

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

Minsoo Kim , Arnav Kundu , Han-Byul Kim , Richa Dixit , Minsik Cho This is my paper

Pith reviewed 2026-05-21 21:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords KV cache compressionlong conversational QAepisodic memoryresource-constrained inferencetraining-free methodsmulti-turn dialogue

0 comments

The pith

EpiCache clusters long conversation history into episodes and evicts KV cache per episode to bound memory use while retaining near-full accuracy on multi-turn QA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EpiCache as a training-free way to manage the exploding KV cache that appears when LLMs keep long dialogue histories. It first splits incoming context into blocks during prefill so peak memory stays fixed, then groups past turns into topic-coherent episodes and drops less relevant keys and values inside each episode. On three long-conversation benchmarks the method delivers up to 30 percent higher accuracy than prior compression schemes, reaches almost the same score as an uncompressed cache at 4-6 times smaller size, and cuts both latency and peak memory. A sympathetic reader would see this as a practical step toward running personalized, multi-hour assistants on phones or laptops that cannot hold millions of tokens in cache.

Core claim

EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and performs episode-specific KV cache eviction. Across LongMemEval, Realtalk, and LoCoMo it improves accuracy by up to 30 percent, reaches near full-cache accuracy under 4-6x compression, and reduces latency and peak memory by up to 2.4x and 3.7x respectively.

What carries the argument

Episodic KV compression: the step that first clusters conversation turns into coherent episodes and then applies eviction separately inside each episode so that topic-relevant context is retained under a fixed memory budget.

If this is right

Accuracy on long conversational QA rises by as much as 30 percent relative to earlier cache-compression baselines.
Near full-cache performance is retained at compression ratios of 4-6x.
Both end-to-end latency and peak memory drop by up to 2.4x and 3.7x under the same accuracy target.
The approach stays training-free and therefore works on existing models without additional fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same episode-clustering idea could be applied to other long-context workloads such as long-document summarization or multi-document retrieval.
Combining episodic eviction with hardware-aware quantization might produce further memory reductions on mobile chips.
If episode boundaries are detected more accurately, the method could support conversations that stretch over many days without manual resets.

Load-bearing premise

That conversation history can be clustered into coherent episodes whose internal context remains sufficient to avoid errors once some keys and values are dropped.

What would settle it

A multi-turn test set in which accuracy falls sharply below the full-cache baseline precisely on questions that require information from an earlier episode after the eviction step has run.

Figures

Figures reproduced from arXiv: 2509.17396 by Arnav Kundu, Han-Byul Kim, Minsik Cho, Minsoo Kim, Richa Dixit.

**Figure 1.** Figure 1: KV Cache Management Analysis. (a) Post prefill eviction: eviction after full-context prefill, reducing KV size at decoding but causing unbounded memory usage. (b) Block prefill eviction: input processed in 3-token blocks with patched prompts for scoring, then evicted to 1 token. (c) Top: Peak GPU memory vs. input length on LLaMA-3.2-3B with A100. Bottom: LongConvQA accuracy of KV compression methods under… view at source ↗

**Figure 2.** Figure 2: presents a controlled experiment where we assume oracle access to the future user query, with inserting it as the patched prompt yielding the highest accuracy (Exact-Question)1 . Since the dialogue history H in Equation (1) consists of question-answer turns, it offers an opportunity to approximate the future query with semantically related turns. To test this idea, we embed both user queries q1, . . . , qN… view at source ↗

**Figure 3.** Figure 3: EpiCache Overview. (a) offline segmentation and embedding of the conversation, followed by clustering into topical episodes. (b) Building episodic KV caches under a fixed GPU memory usage based on representative segments of each cluster. (c) an incoming query is embedded, matched to the closest episode, and the corresponding cache is retrieved for answer generation. Stage 2. Episodic KV Cache Compression… view at source ↗

**Figure 4.** Figure 4: Layer-wise Sensitivity Analysis and KV Budget Allocation. (a) Key states cosine similarity across normalized layer positions. (b) KL divergence is measured between block prefill (M=4K) and full KV answer predictions, with uniform allocation as the baseline. Per-sample KL divergence shifts are shown when applying three allocation strategies—sensitivity-aware, PyramidKV, and retrieval head profiling-on the… view at source ↗

**Figure 5.** Figure 5: LongConvQA Evaluation Results (Realtalk, LoCoMo, and LongMemEval) results with fixed KV cache budget size-M across four LLMs. The number of episodes (clusters) fixed to E=4 in all experiments. The average full KV lengths of the three benchmarks are 26K, 21K, and 20K. 4 EXPERIMENTS 4.1 SETUP Models and Benchmarks. We evaluate on four pretrained LLMs: LLaMA-3.2-3B, LLaMA-3.1- 8B (Grattafiori et al., 2024), Q… view at source ↗

**Figure 6.** Figure 6: Memory Scalability up to 100K Context. Conversation histories between user and LLM-based assistant scaled to 100K tokens across four LLMs with LongMemEval. Comparison of InfiniPot and KVzip (M=6K) with EPICACHE (4 episodes, M=6K–24K). 4.2 MAIN EVALUATION RESULTS LongConvQA Evaluation [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Efficiency Analysis in Multi-Turn Conversation: (a) Per-turn decoding latency and peak GPU memory for full KV (100K) and EPICACHE (E=4) with LLaMA-3.2-3B. Query Embed and Match: query encoding and centroid matching, KVs Retrieve: loading episodic cache from CPU to GPU memory. (b) Cumulative episode switches in Realtalk with E=4, showcasing how often episodes change across multi-turn conversation. EPICACHE … view at source ↗

read the original abstract

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model's memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Additionally, query-dependent eviction narrows the cache semantics to a single query, leading to failure cases in multi-turn conversations. In this paper, we introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and performs episode-specific KV cache eviction. Across three LongConvQA benchmarks (LongMemEval, Realtalk, and LoCoMo), EpiCache improves accuracy by up to 30%, achieves near full-cache accuracy under 4-6x compression, and reduces latency and peak memory by up to 2.4x and 3.7x, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EpiCache clusters conversation history into episodes for per-episode KV eviction, which bounds memory better than query-dependent methods but risks losing cross-episode facts in multi-turn queries.

read the letter

The core idea is straightforward: split long dialogues into coherent episodes, do block-wise prefill to cap peak memory, then evict KV entries episode by episode instead of based on the current query. This avoids the narrow focus that breaks multi-turn coherence in prior compression work. The paper shows this on LongMemEval, Realtalk, and LoCoMo, with accuracy gains up to 30 percent, near full-cache performance at 4-6x compression, and solid drops in latency and peak memory. Those numbers are the practical payoff for resource-constrained settings, and the training-free design makes it easy to try on existing models. The episodic grouping is the clearest addition over earlier block or token-level eviction schemes. The stress-test worry about cross-episode context is real and worth checking in the full text; if later questions reference facts from an earlier evicted episode, the cache cannot recover them, and the reported accuracy would suffer. The abstract gives no specifics on how episodes are detected, what exact eviction rule is used inside each episode, or whether the baselines include strong recent compressors. Without those details or statistical tests, the gains could shrink under different hyperparameter choices. This is aimed at engineers deploying long-context chat on phones or edge devices rather than theorists. It is a useful incremental step with clear motivation and relevant benchmarks, so it should go to peer review for the authors to add the missing implementation details and robustness checks.

Referee Report

2 major / 2 minor

Summary. The paper introduces EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. It bounds cache growth via block-wise prefill and preserves topic-relevant context by clustering conversation history into coherent episodes followed by episode-specific KV cache eviction. On LongMemEval, Realtalk, and LoCoMo benchmarks, it reports accuracy gains of up to 30%, near full-cache accuracy at 4-6x compression, and reductions in latency (up to 2.4x) and peak memory (up to 3.7x).

Significance. If the reported gains hold after addressing details on cross-episode handling and baseline controls, EpiCache could offer a practical, training-free solution for resource-constrained long-context LLMs, directly tackling peak memory spikes and multi-turn failure modes that plague query-dependent eviction methods.

major comments (2)

[§3] §3 (Method), episodic clustering and eviction description: The central claim that episode-specific eviction preserves all topic-relevant tokens for future multi-turn queries rests on the assumption that coherent episodes capture inter-episode dependencies and gradual topic drift. No explicit mechanism (e.g., cross-episode token retention or drift detection) is described, and the abstract's motivation about avoiding query-dependent failure cases is not backed by targeted experiments on queries referencing earlier evicted episodes. This directly bears on the reported accuracy and 'near full-cache' claims under compression.
[§4] §4 (Experiments), baseline and statistical details: The abstract and results claim up to 30% accuracy improvement and 4-6x compression with near full-cache performance, but provide no specifics on exact eviction criteria, episode boundary tuning, baseline implementations, or statistical significance testing. Post-hoc adjustments to clustering parameters could inflate the gains, weakening the load-bearing empirical support for the framework's superiority.

minor comments (2)

[Abstract] Notation for compression ratios and memory metrics should be defined consistently in the first use (e.g., distinguish peak vs. average memory).
[§4] Figure captions for latency/memory plots should include exact model sizes and hardware used to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate planned revisions to improve the manuscript.

read point-by-point responses

Referee: [§3] §3 (Method), episodic clustering and eviction description: The central claim that episode-specific eviction preserves all topic-relevant tokens for future multi-turn queries rests on the assumption that coherent episodes capture inter-episode dependencies and gradual topic drift. No explicit mechanism (e.g., cross-episode token retention or drift detection) is described, and the abstract's motivation about avoiding query-dependent failure cases is not backed by targeted experiments on queries referencing earlier evicted episodes. This directly bears on the reported accuracy and 'near full-cache' claims under compression.

Authors: We agree that §3 would benefit from greater clarity on inter-episode handling. EpiCache forms episodes via semantic similarity clustering of consecutive turns, which by design groups contextually related content to reduce cross-episode dependencies; block-wise prefill further limits peak memory without requiring full-history retention. We acknowledge the absence of explicit drift detection or dedicated experiments on queries that reference earlier episodes. In revision we will expand §3 with a formal description of the clustering objective and add targeted experiments evaluating accuracy on such cross-episode queries to substantiate the near-full-cache claims. revision: yes
Referee: [§4] §4 (Experiments), baseline and statistical details: The abstract and results claim up to 30% accuracy improvement and 4-6x compression with near full-cache performance, but provide no specifics on exact eviction criteria, episode boundary tuning, baseline implementations, or statistical significance testing. Post-hoc adjustments to clustering parameters could inflate the gains, weakening the load-bearing empirical support for the framework's superiority.

Authors: We accept that reproducibility requires these details. The revised manuscript will specify: (i) eviction criteria inside each episode (recency-weighted attention scores with a fixed threshold), (ii) episode boundary detection (cosine similarity threshold of 0.75 selected on a held-out validation split), (iii) baseline re-implementations (exact hyperparameters from the original papers), and (iv) statistical tests (paired t-tests across five random seeds with reported p-values). Clustering parameters were fixed prior to final evaluation; we will add an appendix table documenting the validation procedure to rule out post-hoc inflation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in EpiCache's method description or claims

full rationale

The paper describes a training-free framework using block-wise prefill and episodic clustering for KV cache eviction, with performance claims (accuracy gains up to 30%, near full-cache results under compression, latency/memory reductions) grounded directly in experiments on LongMemEval, Realtalk, and LoCoMo benchmarks. No equations, fitted parameters renamed as predictions, self-citation chains, uniqueness theorems, or ansatzes appear in the provided abstract or method outline. The derivation chain consists of straightforward algorithmic steps validated externally, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions about LLM cache semantics and the value of topic coherence; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Conversation history can be partitioned into coherent episodes that share topic relevance for future queries.
Central to the episodic compression strategy described in the abstract.

pith-pipeline@v0.9.0 · 5763 in / 1212 out tokens · 28105 ms · 2026-05-21T21:50:53.477286+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 13 internal anchors

[1]

GQA : Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 4895--4901, Singapor...

work page doi:10.18653/v1/2023.emnlp-main.298 2023
[2]

Introducing the next generation of claude

Anthropic . Introducing the next generation of claude. https://www.anthropic.com/news/claude-3-family, 2024

work page 2024
[3]

k-means++: the advantages of careful seeding

David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '07, pp.\ 1027–1035, USA, 2007. Society for Industrial and Applied Mathematics. ISBN 9780898716245

work page 2007
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 1901
[5]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling, 2025. URL https://arxiv.org/abs/2406.02069

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

FINCH : Prompt-guided key-value cache compression for large language models

Giulio Corallo and Paolo Papotti. FINCH : Prompt-guided key-value cache compression for large language models. Transactions of the Association for Computational Linguistics, 12: 0 1517--1532, 2024. doi:10.1162/tacl_a_00716. URL https://aclanthology.org/2024.tacl-1.83/

work page doi:10.1162/tacl_a_00716 2024
[8]

Flash A ttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[9]

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Learning towards conversational ai: A survey

Tingchen Fu, Shen Gao, Xueliang Zhao, Ji rong Wen, and Rui Yan. Learning towards conversational ai: A survey. AI Open, 3: 0 14--28, 2022. ISSN 2666-6510. doi:https://doi.org/10.1016/j.aiopen.2022.02.001. URL https://www.sciencedirect.com/science/article/pii/S2666651022000079

work page doi:10.1016/j.aiopen.2022.02.001 2022
[11]

McKeown, Eric Fosler-Lussier, and Hongyan Jing

Michel Galley, Kathleen R. McKeown, Eric Fosler-Lussier, and Hongyan Jing. Discourse segmentation of multi-party conversation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp.\ 562--569, Sapporo, Japan, July 2003. Association for Computational Linguistics. doi:10.3115/1075096.1075167. URL https://aclanthology...

work page doi:10.3115/1075096.1075167 2003
[12]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

A ^2 ATS : Retrieval-based KV cache reduction via windowed rotary position embedding and query-aware vector quantization

Junhui He, Junna Xing, Nan Wang, Rui Xu, Shangyu Wu, Peng Zhou, Qiang Liu, Chun Jason Xue, and Qingan Li. A ^2 ATS : Retrieval-based KV cache reduction via windowed rotary position embedding and query-aware vector quantization. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computatio...

work page doi:10.18653/v1/2025.findings-acl.644 2025
[14]

Kvquant: Towards 10 million context length llm inference with kv cache quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024

work page arXiv 2024
[15]

Mahoney, Kurt Keutzer, and Amir Gholami

Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, Sebastian Zhao, June Paik, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. Squeezed attention: Accelerating long context length LLM inference. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meet...

work page doi:10.18653/v1/2025.acl-long.1568 2025
[16]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

S. Joty, G. Carenini, and R. T. Ng. Topic segmentation and labeling in asynchronous conversations. Journal of Artificial Intelligence Research, 47: 0 521–573, July 2013. ISSN 1076-9757. doi:10.1613/jair.3940. URL http://dx.doi.org/10.1613/jair.3940

work page doi:10.1613/jair.3940 2013
[18]

Lee, Sangdoo Yun, and Hyun Oh Song

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction, 2025. URL https://arxiv.org/abs/2505.23416

work page arXiv 2025
[19]

I nfini P ot: Infinite context processing on memory-constrained LLM s

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. I nfini P ot: Infinite context processing on memory-constrained LLM s. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 16046--16060, Miami, Florida, USA, November 2024. Association for Comput...

work page doi:10.18653/v1/2024.emnlp-main.897 2024
[20]

Booksum: A collection of datasets for long-form narrative summarization, 2022

Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization, 2022. URL https://arxiv.org/abs/2105.08209

work page arXiv 2022
[21]

Realtalk: A 21-day real-world dataset for long-term conversation

Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, and Francesco Barbieri. Realtalk: A 21-day real-world dataset for long-term conversation. arXiv preprint arXiv:2502.13270, 2025

work page arXiv 2025
[22]

Snap KV : LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snap KV : LLM knows what you are looking for before generation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=poE54GOq2l

work page 2024
[23]

Clusterkv: Manipulating llm kv cache in semantic space for recallable compression,

Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating llm kv cache in semantic space for recallable compression, 2024 a . URL https://arxiv.org/abs/2412.03213

work page arXiv 2024
[24]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 13851--13870, Bangkok...

work page doi:10.18653/v1/2024.acl-long.747 2024
[26]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation

Meta . The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence, 2025. Accessed: 2025-01-25

work page 2025
[27]

Kv-cache compression leaderboard

NVIDIA . Kv-cache compression leaderboard. https://huggingface.co/spaces/nvidia/kvpress-leaderboard, 2025. Accessed: 2025-09-01

work page 2025
[28]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments, 2025

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments, 2025. URL https://arxiv.org/abs/2504.15364

work page arXiv 2025
[30]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Revisiting clustering for efficient unsupervised dialogue structure induction

Maarten Raedt, Fréderic Godin, Chris Develder, and Thomas Demeester. Revisiting clustering for efficient unsupervised dialogue structure induction. Applied Intelligence, 54: 0 1--28, 04 2024. doi:10.1007/s10489-024-05455-5

work page doi:10.1007/s10489-024-05455-5 2024
[32]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[34]

o rn Gamb \

Gregor Sieber and Brigitte Krenn. Episodic memory for companion dialogue. In Yorick Wilks, Bj \"o rn Gamb \"a ck, and Morena Danieli (eds.), Proceedings of the 2010 Workshop on Companionable Dialogue Systems, pp.\ 1--6, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL https://aclanthology.org/W10-2701/

work page 2010
[35]

QUEST : Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST : Query-aware sparsity for efficient long-context LLM inference. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=KzACYw0MTV

work page 2024
[36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Longmemeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=pZiyCaVuti

work page 2025
[38]

Retrieval head mechanistically explains long-context factuality

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality. In The Thirteenth International Conference on Learning Representations, 2025 b . URL https://openreview.net/forum?id=EytBpUGB1Z

work page 2025
[39]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF

work page 2024
[40]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Re, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.ne...

work page 2023
[43]

Adversarial eval

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence, 38 0 (17): 0 19724--19731, Mar. 2024. doi:10.1609/aaai.v38i17.29946. URL https://ojs.aaai.org/index.php/AAAI/article/view/29946

work page doi:10.1609/aaai.v38i17.29946 2024
[44]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[45]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[46]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[47]

Arthur, S

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.5555/1283383.1283494 2024

[1] [1]

GQA : Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 4895--4901, Singapor...

work page doi:10.18653/v1/2023.emnlp-main.298 2023

[2] [2]

Introducing the next generation of claude

Anthropic . Introducing the next generation of claude. https://www.anthropic.com/news/claude-3-family, 2024

work page 2024

[3] [3]

k-means++: the advantages of careful seeding

David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '07, pp.\ 1027–1035, USA, 2007. Society for Industrial and Applied Mathematics. ISBN 9780898716245

work page 2007

[4] [4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 1901

[5] [5]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling, 2025. URL https://arxiv.org/abs/2406.02069

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

FINCH : Prompt-guided key-value cache compression for large language models

Giulio Corallo and Paolo Papotti. FINCH : Prompt-guided key-value cache compression for large language models. Transactions of the Association for Computational Linguistics, 12: 0 1517--1532, 2024. doi:10.1162/tacl_a_00716. URL https://aclanthology.org/2024.tacl-1.83/

work page doi:10.1162/tacl_a_00716 2024

[8] [8]

Flash A ttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[9] [9]

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Learning towards conversational ai: A survey

Tingchen Fu, Shen Gao, Xueliang Zhao, Ji rong Wen, and Rui Yan. Learning towards conversational ai: A survey. AI Open, 3: 0 14--28, 2022. ISSN 2666-6510. doi:https://doi.org/10.1016/j.aiopen.2022.02.001. URL https://www.sciencedirect.com/science/article/pii/S2666651022000079

work page doi:10.1016/j.aiopen.2022.02.001 2022

[11] [11]

McKeown, Eric Fosler-Lussier, and Hongyan Jing

Michel Galley, Kathleen R. McKeown, Eric Fosler-Lussier, and Hongyan Jing. Discourse segmentation of multi-party conversation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp.\ 562--569, Sapporo, Japan, July 2003. Association for Computational Linguistics. doi:10.3115/1075096.1075167. URL https://aclanthology...

work page doi:10.3115/1075096.1075167 2003

[12] [12]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

A ^2 ATS : Retrieval-based KV cache reduction via windowed rotary position embedding and query-aware vector quantization

Junhui He, Junna Xing, Nan Wang, Rui Xu, Shangyu Wu, Peng Zhou, Qiang Liu, Chun Jason Xue, and Qingan Li. A ^2 ATS : Retrieval-based KV cache reduction via windowed rotary position embedding and query-aware vector quantization. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computatio...

work page doi:10.18653/v1/2025.findings-acl.644 2025

[14] [14]

Kvquant: Towards 10 million context length llm inference with kv cache quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024

work page arXiv 2024

[15] [15]

Mahoney, Kurt Keutzer, and Amir Gholami

Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, Sebastian Zhao, June Paik, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. Squeezed attention: Accelerating long context length LLM inference. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meet...

work page doi:10.18653/v1/2025.acl-long.1568 2025

[16] [16]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

S. Joty, G. Carenini, and R. T. Ng. Topic segmentation and labeling in asynchronous conversations. Journal of Artificial Intelligence Research, 47: 0 521–573, July 2013. ISSN 1076-9757. doi:10.1613/jair.3940. URL http://dx.doi.org/10.1613/jair.3940

work page doi:10.1613/jair.3940 2013

[18] [18]

Lee, Sangdoo Yun, and Hyun Oh Song

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction, 2025. URL https://arxiv.org/abs/2505.23416

work page arXiv 2025

[19] [19]

I nfini P ot: Infinite context processing on memory-constrained LLM s

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. I nfini P ot: Infinite context processing on memory-constrained LLM s. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 16046--16060, Miami, Florida, USA, November 2024. Association for Comput...

work page doi:10.18653/v1/2024.emnlp-main.897 2024

[20] [20]

Booksum: A collection of datasets for long-form narrative summarization, 2022

Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization, 2022. URL https://arxiv.org/abs/2105.08209

work page arXiv 2022

[21] [21]

Realtalk: A 21-day real-world dataset for long-term conversation

Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, and Francesco Barbieri. Realtalk: A 21-day real-world dataset for long-term conversation. arXiv preprint arXiv:2502.13270, 2025

work page arXiv 2025

[22] [22]

Snap KV : LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snap KV : LLM knows what you are looking for before generation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=poE54GOq2l

work page 2024

[23] [23]

Clusterkv: Manipulating llm kv cache in semantic space for recallable compression,

Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating llm kv cache in semantic space for recallable compression, 2024 a . URL https://arxiv.org/abs/2412.03213

work page arXiv 2024

[24] [24]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 13851--13870, Bangkok...

work page doi:10.18653/v1/2024.acl-long.747 2024

[26] [26]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation

Meta . The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence, 2025. Accessed: 2025-01-25

work page 2025

[27] [27]

Kv-cache compression leaderboard

NVIDIA . Kv-cache compression leaderboard. https://huggingface.co/spaces/nvidia/kvpress-leaderboard, 2025. Accessed: 2025-09-01

work page 2025

[28] [28]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments, 2025

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments, 2025. URL https://arxiv.org/abs/2504.15364

work page arXiv 2025

[30] [30]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Revisiting clustering for efficient unsupervised dialogue structure induction

Maarten Raedt, Fréderic Godin, Chris Develder, and Thomas Demeester. Revisiting clustering for efficient unsupervised dialogue structure induction. Applied Intelligence, 54: 0 1--28, 04 2024. doi:10.1007/s10489-024-05455-5

work page doi:10.1007/s10489-024-05455-5 2024

[32] [32]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019

[34] [34]

o rn Gamb \

Gregor Sieber and Brigitte Krenn. Episodic memory for companion dialogue. In Yorick Wilks, Bj \"o rn Gamb \"a ck, and Morena Danieli (eds.), Proceedings of the 2010 Workshop on Companionable Dialogue Systems, pp.\ 1--6, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL https://aclanthology.org/W10-2701/

work page 2010

[35] [35]

QUEST : Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST : Query-aware sparsity for efficient long-context LLM inference. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=KzACYw0MTV

work page 2024

[36] [36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Longmemeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=pZiyCaVuti

work page 2025

[38] [38]

Retrieval head mechanistically explains long-context factuality

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality. In The Thirteenth International Conference on Learning Representations, 2025 b . URL https://openreview.net/forum?id=EytBpUGB1Z

work page 2025

[39] [39]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF

work page 2024

[40] [40]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Re, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.ne...

work page 2023

[43] [43]

Adversarial eval

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence, 38 0 (17): 0 19724--19731, Mar. 2024. doi:10.1609/aaai.v38i17.29946. URL https://ojs.aaai.org/index.php/AAAI/article/view/29946

work page doi:10.1609/aaai.v38i17.29946 2024

[44] [44]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[45] [45]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[46] [46]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[47] [47]

Arthur, S

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.5555/1283383.1283494 2024