LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding

Chongjun Tu; Jiakang Yuan; Peng Ye; Tao Chen; Yichen Jiang

arxiv: 2601.11913 · v2 · submitted 2026-01-17 · 💻 cs.CL · cs.AI

LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding

Yichen Jiang , Jiakang Yuan , Chongjun Tu , Peng Ye , Tao Chen This is my paper

Pith reviewed 2026-05-16 13:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-agent systemslong-context understandingLSTMlarge language modelsquestion answeringhallucination mitigationgated memory

0 comments

The pith

A multi-agent system modeled on LSTM gates processes long contexts by controlling information flow across segments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes LSTM-MAS, a chained multi-agent architecture that copies the gated memory structure of LSTM networks to handle extended text inputs in large language models. Each segment passes through four specialized agents that respectively ingest new content, drop redundancies, detect errors, and regulate what gets retained or passed forward. The design targets the error buildup and hallucination spread that plague existing multi-agent long-context methods. If the mapping works, the system would deliver reliable comprehension over documents far longer than standard context windows without the usual compute penalties of attention scaling.

Core claim

LSTM-MAS organizes agents in a chained architecture where each node comprises a worker agent for segment-level comprehension, a filter agent for redundancy reduction, a judge agent for continuous error detection, and a manager agent for globally regulating information propagation and retention, analogous to LSTM and its input gate, forget gate, constant error carousel unit, and output gate. These designs enable controlled information transfer and selective long-term dependency modeling across textual segments, which can effectively avoid error accumulation and hallucination propagation.

What carries the argument

Chained architecture of worker, filter, judge, and manager agents that emulate LSTM input gate, forget gate, constant error carousel, and output gate for selective memory retention.

If this is right

Delivers large gains over prior multi-agent baselines on Narrative QA, Qasper, HotpotQA, 2WikiMQA, and MuSiQue.
Reduces accumulation of errors when text is broken into sequential segments.
Avoids the extra compute cost of expanding a single model's context window.
Allows selective retention of long-range dependencies while discarding irrelevant material.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gated-division approach could transfer to long-document summarization or multi-turn dialogue without retraining the underlying models.
Explicit agent roles may yield more traceable decision logs than end-to-end attention over long sequences.
If the LSTM analogy holds, other recurrent or state-space architectures could supply further templates for multi-agent coordination.

Load-bearing premise

Mapping the four LSTM gate functions to four distinct agent roles will produce actual gated control that blocks error and hallucination propagation between successive text segments.

What would settle it

On a held-out long-context QA benchmark, LSTM-MAS shows no gain over CoA or produces a measurable rise in factual inconsistencies that increase with the number of processed segments.

Figures

Figures reproduced from arXiv: 2601.11913 by Chongjun Tu, Jiakang Yuan, Peng Ye, Tao Chen, Yichen Jiang.

**Figure 1.** Figure 1: Analogical comparison between three multi-agent systems (MASs) and traditional neural network structures. Specifically, the upper left panel draws [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Test Results of Qwen2.5-0.5B (left) and Qwen2.5-1.5B (right) as base models on vanilla, LightRAG, CoA, and LSTM-MAS. LSTM-MAS achieved [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of LSTM-MAS, which is organized in a chain structure. Each node includes a worker agent, a filter agent and a judge agent. At the end [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Needle-in-Haystack Test Results with Qwen2.5(left) and ERNIE 3.5(right) models as Filter. This indicates that the ERNIE series models are more [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: An case study of CoA(left) and LSTM-MAS(right). This demonstrates that the filter and judger in LSTM-MAS can effectively recover errors in the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Effectively processing long contexts remains a fundamental yet unsolved challenge for large language models (LLMs). Existing single-LLM-based methods primarily reduce the context window or optimize the attention mechanism, but they often encounter additional computational costs or constrained expanded context length. While multi-agent-based frameworks can mitigate these limitations, they remain susceptible to the accumulation of errors and the propagation of hallucinations. In this work, we draw inspiration from the Long Short-Term Memory (LSTM) architecture to design a Multi-Agent System called LSTM-MAS, emulating LSTM's hierarchical information flow and gated memory mechanisms for long-context understanding. Specifically, LSTM-MAS organizes agents in a chained architecture, where each node comprises a worker agent for segment-level comprehension, a filter agent for redundancy reduction, a judge agent for continuous error detection, and a manager agent for globally regulates information propagation and retention, analogous to LSTM and its input gate, forget gate, constant error carousel unit, and output gate. These novel designs enable controlled information transfer and selective long-term dependency modeling across textual segments, which can effectively avoid error accumulation and hallucination propagation. We conducted an extensive evaluation of our method. Compared with the previous best multi-agent approach, CoA, our model achieves improvements of 97.97%, 65.75%, 122.19%, 39.61% and 10.80% on Narrative QA, Qasper, HotpotQA, 2WikiMQA and MuSiQue, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LSTM-MAS claims very large gains over CoA on long-context QA by chaining four agent roles modeled on LSTM gates, but the abstract supplies no methods, ablations, or stats to show the mapping actually drives the results.

read the letter

The paper's main move is to organize a multi-agent system into a fixed chain—worker for segment reading, filter for dropping repeats, judge for catching mistakes, manager for global decisions—each tied to one LSTM gate. It reports big lifts over the prior multi-agent baseline CoA: roughly 98% on NarrativeQA, 66% on Qasper, 122% on HotpotQA, and smaller but still positive numbers on the other two sets. That architecture is the clearest new piece; the specific gate-to-role mapping and the claim that it limits error carry-over across segments do not appear in the cited prior work. The motivation is straightforward and the problem it targets (hallucination buildup in long multi-agent traces) is real. The write-up keeps the analogy tight enough that a reader can see how the pieces are supposed to interact. The weak part is the evidence. The abstract gives no protocol, no token or call budgets matched to CoA, no variance numbers, and no ablation that removes the filter or judge to test whether the gate logic is doing the work. Without those, the size of the gains makes it hard to rule out extra inference steps or prompt differences as the real cause. The central assumption—that the filter and judge will behave like selective retention rather than just passing summaries—remains unshown. This is for groups already working on modular agent pipelines for document QA or multi-hop reasoning. Someone looking for a concrete way to add memory control to agents could borrow the role definitions even if the numbers need re-checking. I would send it to peer review so the methods section can be examined and the missing controls requested; the structure is clear enough to be worth that step.

Referee Report

3 major / 2 minor

Summary. The paper proposes LSTM-MAS, a multi-agent system for long-context understanding that emulates LSTM gated memory mechanisms via a chained architecture of worker, filter, judge, and manager agents (mapping to input gate, forget gate, constant error carousel, and output gate). The design aims to enable controlled cross-segment information transfer that prevents error accumulation and hallucination propagation. The central empirical claim is that LSTM-MAS outperforms the prior best multi-agent baseline CoA by 97.97%, 65.75%, 122.19%, 39.61%, and 10.80% on NarrativeQA, Qasper, HotpotQA, 2WikiMQA, and MuSiQue, respectively.

Significance. If the performance deltas can be shown to arise from the proposed information-flow controls rather than from unaccounted differences in compute or prompting, the work would supply a concrete, modular template for multi-agent long-context systems that explicitly models retention, error detection, and global regulation; this could be useful for tasks requiring synthesis across long documents where standard retrieval or summarization pipelines still suffer from drift.

major comments (3)

[Abstract] Abstract: the reported percentage improvements (97.97% on NarrativeQA etc.) are presented without any baseline absolute scores, standard deviations, number of runs, or description of the evaluation protocol, so the central performance claim cannot be evaluated for statistical reliability or sensitivity to prompt variations.
[Methods / Architecture] Architecture description: the filter agent's redundancy-reduction rule and the judge agent's error-detection signal are described only at the level of the LSTM analogy; without pseudocode, decision thresholds, or state-update equations it is impossible to verify that these components implement selective retention and correction rather than generic summarization or re-prompting.
[Experiments] Experiments: no ablation studies isolate the contribution of the LSTM-gate mapping, and no token- or call-budget comparison versus CoA is supplied; therefore the large deltas cannot be attributed to the claimed controlled information transfer rather than to higher total LLM usage or longer effective context per query.

minor comments (2)

[Abstract] Abstract: the phrase 'globally regulates information propagation' contains a subject-verb agreement issue ('regulates' should be 'regulate' if the manager is the subject).
[Introduction / Methods] Notation: the mapping of LSTM components to agents is introduced without a summary table, making it harder to track the correspondence across the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the current manuscript requires additional details to substantiate the performance claims, clarify the architecture, and isolate the contributions of the proposed components. We will revise the paper accordingly to address all major comments.

read point-by-point responses

Referee: [Abstract] Abstract: the reported percentage improvements (97.97% on NarrativeQA etc.) are presented without any baseline absolute scores, standard deviations, number of runs, or description of the evaluation protocol, so the central performance claim cannot be evaluated for statistical reliability or sensitivity to prompt variations.

Authors: We agree that absolute baseline scores, standard deviations, run counts, and evaluation protocol details are essential for assessing the reliability of the reported improvements. In the revised manuscript, we will expand the abstract and add a dedicated results table that includes absolute F1/accuracy scores for both LSTM-MAS and CoA, along with standard deviations from multiple runs (we performed 3 independent runs per dataset) and a brief description of the evaluation protocol, including prompt templates and dataset splits used. revision: yes
Referee: [Methods / Architecture] Architecture description: the filter agent's redundancy-reduction rule and the judge agent's error-detection signal are described only at the level of the LSTM analogy; without pseudocode, decision thresholds, or state-update equations it is impossible to verify that these components implement selective retention and correction rather than generic summarization or re-prompting.

Authors: We acknowledge that the current description relies primarily on the LSTM analogy without sufficient implementation specifics. In the revision, we will add pseudocode for the filter and judge agents, explicit decision thresholds (e.g., cosine similarity threshold of 0.85 for redundancy filtering), and state-update equations that formalize how information is selectively retained or corrected. These additions will clarify the mechanisms for controlled retention and error detection beyond generic summarization. revision: yes
Referee: [Experiments] Experiments: no ablation studies isolate the contribution of the LSTM-gate mapping, and no token- or call-budget comparison versus CoA is supplied; therefore the large deltas cannot be attributed to the claimed controlled information transfer rather than to higher total LLM usage or longer effective context per query.

Authors: We agree that ablations and resource comparisons are required to attribute performance gains specifically to the LSTM-inspired gating rather than differences in compute. The revised version will include ablation studies (e.g., variants without the judge agent or filter agent) to isolate the contribution of the gate mappings. We will also report total token usage and LLM call counts for LSTM-MAS versus CoA across all datasets to enable direct comparison of efficiency. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces LSTM-MAS as a novel multi-agent architecture explicitly constructed by analogy to LSTM gates (worker=input, filter=forget, judge=constant error carousel, manager=output). No equations, fitted parameters, or self-citations are shown that reduce the reported empirical gains (e.g., 97.97% on NarrativeQA) to quantities already present in the model definition or prior self-work. The performance deltas are presented as evaluation outcomes on standard benchmarks, not as predictions forced by construction. The derivation chain is self-contained as an empirical architecture proposal without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested mapping from LSTM gates to agent roles and on the assumption that this mapping solves error accumulation; no free parameters are stated in the abstract.

axioms (1)

domain assumption LSTM gated memory mechanisms can be directly emulated by specialized LLM agents to control long-range information flow.
Invoked as the core design principle in the abstract.

invented entities (1)

LSTM-MAS chained architecture with worker, filter, judge, and manager agents no independent evidence
purpose: To emulate LSTM gates for selective long-term dependency modeling and error control
Newly introduced construct whose effectiveness is asserted but not independently verified outside the reported experiments.

pith-pipeline@v0.9.0 · 5575 in / 1255 out tokens · 41309 ms · 2026-05-16T13:18:06.410435+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory
cs.AI 2026-04 unverdicted novelty 5.0

FSFM is a biologically-inspired selective forgetting framework for LLM agents that claims to boost access efficiency by 8.49%, content quality by 29.2% signal-to-noise, and eliminate security risks entirely through a ...

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

GPT-4 Technical Report

J. Achiamet al., “Gpt-4 technical report,” March 2024. [Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Qwen Technical Report

J. Baiet al., “Qwen technical report,” September 2023. [Online]. Available: https://arxiv.org/abs/2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Internlm2 technical report,

Z. Cai and others., “Internlm2 technical report,” March 2024

work page 2024
[4]

Chatglm: A family of large language models from glm-130b to glm-4 all tools,

A. Zeng and other., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,” June 2024

work page 2024
[5]

The Llama 3 Herd of Models

A. Grattafioriet al., “The llama 3 herd of models,” November 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

How long can open-source llms truly promise on context length?

D. Li*, R. Shao*et al., “How long can open-source llms truly promise on context length?” June 2023. [Online]. Available: https://lmsys.org/blog/2023-06-29-longchat

work page 2023
[7]

Minimax-01: Scaling foundation models with lightning attention,

A. Liet al., “Minimax-01: Scaling foundation models with lightning attention,” January 2025. [Online]. Available: https://arxiv.org/abs/2501. 08313

work page 2025
[8]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zhenget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” June 2023

work page 2023
[9]

Hotpotqa: A dataset for diverse, explainable multi- hop question answering,

Z. Yanget al., “Hotpotqa: A dataset for diverse, explainable multi- hop question answering,” inProc. Conf. Empirical Methods in Natural Language Processing, Brussels, Belgium, October-November 2018, pp. 2369–2380

work page 2018
[10]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps,

X. Ho, A. D. Nguyen, S. Sugawara, and A. Aizawa, “Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps,” inProc. Conf. Association for Computational Linguistics, Barcelona, Spain, December 2020, pp. 6609–6625

work page 2020
[11]

Musique: Multihop questions via single-hop question composition,

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Musique: Multihop questions via single-hop question composition,”Proc. Conf. Association for Computational Linguistics, vol. 10, pp. 539–554, May 2022

work page 2022
[12]

QMSum: A new benchmark for query-based multi- domain meeting summarization,

M. Zhonget al., “QMSum: A new benchmark for query-based multi- domain meeting summarization,” inProc. Conf. Association for Com- putational Linguistics, June 2021, pp. 5905–5921

work page 2021
[13]

Efficient attentions for long document summarization,

L. Huang, S. Cao, N. Parulian, H. Ji, and L. Wang, “Efficient attentions for long document summarization,” inProc. Conf. Association for Computational Linguistics, June 2021, pp. 1419–1436

work page 2021
[14]

An exploratory study on long dialogue summarization: What works and what’s next,

Y . Zhanget al., “An exploratory study on long dialogue summarization: What works and what’s next,” inProc. Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, November 2021, pp. 4426–4433

work page 2021
[15]

Leveraging bilateral correlations for multi-label few-shot learning,

Y . An, H. Xue, X. Zhao, N. Xu, P. Fang, and X. Geng, “Leveraging bilateral correlations for multi-label few-shot learning,”IEEE Trans. Neural Netw. & Learn. Syst., vol. 36, no. 4, pp. 6816–6828, April 2025

work page 2025
[16]

Few-shot learning with enhancements to data augmentation and feature extraction,

Y . Zhang, M. Gong, J. Li, K. Feng, and M. Zhang, “Few-shot learning with enhancements to data augmentation and feature extraction,”IEEE Trans. Neural Netw. & Learn. Syst., vol. 36, no. 4, pp. 6655–6668, June 2025

work page 2025
[17]

Comparative analysis of zero-shot, few-shot, and one-shot learning in natural language processing,

M. R. Sudarshan and C. Kumbhar, “Comparative analysis of zero-shot, few-shot, and one-shot learning in natural language processing,” in2024 International Conference on Augmented Reality, Intelligent Systems, and Industrial Automation (ARIIA), December 2024, pp. 1–6

work page 2024
[19]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

P. Lewiset al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in Neural Inf. Process. Syst., vol. 33, pp. 9459– 9474, May 2021

work page 2021
[20]

Gemini: A Family of Highly Capable Multimodal Models

S. Chen, S. Wong, L. Chen, and Y . Tian, “Extending context window of large language models via positional interpolation,” June 2025. [Online]. Available: https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Roformer: En- hanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, November 2024

work page 2024
[22]

Leave no context behind: Efficient infinite context transformers with infini-attention,

T. Munkhdalai, M. Faruqui, and S. Gopal, “Leave no context behind: Efficient infinite context transformers with infini-attention,” August

work page
[23]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

[Online]. Available: https://arxiv.org/abs/2404.07143

work page internal anchor Pith review arXiv
[24]

2-d transformer: Extending large language models to long-context with few memory,

X. He, J. Liu, and Y . Duan, “2-d transformer: Extending large language models to long-context with few memory,”IEEE Trans. Neural Netw. & Learn. Syst., vol. 36, no. 8, pp. 15 294–15 308, March 2025

work page 2025
[25]

LONGAGENT: Achieving question answering for 128k- token-long documents through multi-agent collaboration,

J. Zhaoet al., “LONGAGENT: Achieving question answering for 128k- token-long documents through multi-agent collaboration,” inProc. Conf. Empirical Methods in Natural Language Processing, Miami, Florida, USA, November 2024, pp. 16 310–16 324

work page 2024
[26]

Chain of agents: Large language models collaborating on long-context tasks,

Y . Zhang, R. Sun, Y . Chen, T. Pfister, R. Zhang, and S. Arik, “Chain of agents: Large language models collaborating on long-context tasks,” Advances in Neural Inf. Process. Syst., vol. 37, pp. 132 208–132 237, June 2024

work page 2024
[27]

Jordan,Attractor dynamics and parallelism in a connectionist sequential machine

M. Jordan,Attractor dynamics and parallelism in a connectionist sequential machine. IEEE Press, 1990, p. 112–127

work page 1990
[28]

Finding structure in time,

J. L. Elman, “Finding structure in time,”Cognitive science, vol. 14, no. 2, pp. 179–211, 1990

work page 1990
[29]

Long short-term memory,

S. Hochreiter and J. Sch ¨urgen, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[30]

Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory

C. Xiaoet al., “Infllm: Training-free long-context extrapolation for llms with an efficient context memory,” May 2024. [Online]. Available: https://arxiv.org/abs/2402.04617

work page arXiv 2024
[31]

Long context compression with activation beacon,

P. Zhang, Z. Liu, S. Xiao, N. Shao, Q. Ye, and Z. Dou, “Long context compression with activation beacon,”Proc. Int. Conf. Learn. Representations, October 2024

work page 2024
[32]

Landmark attention: Random-access in- finite context length for transformers,

A. Mohtashami and M. Jaggi, “Landmark attention: Random-access in- finite context length for transformers,”Advances in Neural Inf. Process. Syst., November 2023

work page 2023
[33]

Examining inter- consistency of large language models collaboration: An in-depth analysis via debate,

K. Xiong, X. Ding, Y . Cao, T. Liu, and B. Qin, “Examining inter- consistency of large language models collaboration: An in-depth analysis via debate,” inProc. Conf. Empirical Methods in Natural Language Processing, Singapore, December 2023, pp. 7572–7590

work page 2023
[34]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

L. Wanget al., “Text embeddings by weakly-supervised contrastive pre-training,” December 2024. [Online]. Available: https://arxiv.org/abs/ 2212.03533

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

How to train your dragon: Diverse augmentation towards generalizable dense retrieval,

S. Lin and other, “How to train your dragon: Diverse augmentation towards generalizable dense retrieval,” inProc. Conf. Empirical Methods in Natural Language Processing, Singapore, December 2023, pp. 6385– 6400

work page 2023
[36]

Cross-batch reference learning for deep retrieval,

H. Yang, K. Lin, T. Chen, and C. Chen, “Cross-batch reference learning for deep retrieval,” vol. 31, no. 9, pp. 3145–3158, September 2020

work page 2020
[37]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffelet al., “Exploring the limits of transfer learning with a unified text-to-text transformer,”J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, September 2023

work page 2023
[38]

Train short, test long: Attention with linear biases enables input length extrapolation,

O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,”Proc. Int. Conf. Learn. Representations, August 2021

work page 2021
[39]

Clex: Continuous length extrapolation for large language models,

G. Chen, X. Li, Z. Meng, S. Liang, and L. Bing, “Clex: Continuous length extrapolation for large language models,”Proc. Int. Conf. Learn. Representations, October 2023

work page 2023
[40]

Walking down the memory maze: Beyond context limit through interactive reading,

H. Chen, R. Pasunuru, J. Weston, and A. Celikyilmaz, “Walking down the memory maze: Beyond context limit through interactive reading,” October 2023. [Online]. Available: https://arxiv.org/abs/2310.05029

work page arXiv 2023
[41]

TransAgents: Build your translation company with language agents,

M. Wu, J. Xu, and L. Wang, “TransAgents: Build your translation company with language agents,” inProc. Conf. Empirical Methods in Natural Language Processing, Miami, Florida, USA, November 2024, pp. 131–141

work page 2024
[42]

Lstm: A search space odyssey,

K. Greff, R. K. Srivastava, J. Koutn ´ık, B. R. Steunebrink, and J. Schmid- huber, “Lstm: A search space odyssey,”IEEE Trans. Neural Netw. & Learn. Syst., vol. 28, no. 10, pp. 2222–2232, July 2017

work page 2017
[43]

LongBench: A bilingual, multitask benchmark for long context understanding,

Y . Baiet al., “LongBench: A bilingual, multitask benchmark for long context understanding,” inProc. Conf. Association for Computational Linguistics. Bangkok, Thailand: Association for Computational Lin- guistics, August 2024, pp. 3119–3137

work page 2024
[44]

The NarrativeQA reading comprehension challenge,

T. Ko ˇcisk`yet al., “The NarrativeQA reading comprehension challenge,” Proc. Conf. Association for Computational Linguistics, vol. 6, pp. 317– 328, 2018

work page 2018
[45]

A dataset of information-seeking questions and answers anchored in research papers,

P. Dasigi, K. Lo, I. Beltagy, A. Cohan, N. A. Smith, and M. Gardner, “A dataset of information-seeking questions and answers anchored in research papers,” inProc. Conf. Association for Computational Linguis- tics, Online, June 2021, pp. 4599–4610

work page 2021
[46]

SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization,

B. Gliwa, I. Mochol, M. Biesek, and A. Wawer, “SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization,” in Proceedings of the 2nd Workshop on New Frontiers in Summarization, Hong Kong, China, November 2019, pp. 70–79

work page 2019
[47]

Qwen2.5 Technical Report

A. Yanget al., “Qwen2. 5 technical report,” December 2025. [Online]. Available: https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

M. Abdin, J. Aneja, and H. Awadalla, “Phi-3 technical report: A highly capable language model locally on your phone,” April 2024. [Online]. Available: https://arxiv.org/abs/2404.14219

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

ERNIE: Enhanced language representation with informative entities,

Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ERNIE: Enhanced language representation with informative entities,” inProc. Conf. Association for Computational Linguistics, Florence, Italy, July 2019, pp. 1441–1451. LSTM-MAS: A LONG SHORT-TERM MEMORY INSPIRED MULTI-AGENT SYSTEM FOR LONG-CONTEXT UNDERSTANDING 12

work page 2019
[50]

LightRAG: Simple and fast retrieval-augmented generation,

Z. Guo, L. Xia, Y . Yu, A. Tu, and C. Huang, “LightRAG: Simple and fast retrieval-augmented generation,” inProc. Conf. Empirical Methods in Natural Language Processing, Suzhou, China, November 2025, pp. 10 746–10 761

work page 2025

[1] [1]

GPT-4 Technical Report

J. Achiamet al., “Gpt-4 technical report,” March 2024. [Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Qwen Technical Report

J. Baiet al., “Qwen technical report,” September 2023. [Online]. Available: https://arxiv.org/abs/2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Internlm2 technical report,

Z. Cai and others., “Internlm2 technical report,” March 2024

work page 2024

[4] [4]

Chatglm: A family of large language models from glm-130b to glm-4 all tools,

A. Zeng and other., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,” June 2024

work page 2024

[5] [5]

The Llama 3 Herd of Models

A. Grattafioriet al., “The llama 3 herd of models,” November 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

How long can open-source llms truly promise on context length?

D. Li*, R. Shao*et al., “How long can open-source llms truly promise on context length?” June 2023. [Online]. Available: https://lmsys.org/blog/2023-06-29-longchat

work page 2023

[7] [7]

Minimax-01: Scaling foundation models with lightning attention,

A. Liet al., “Minimax-01: Scaling foundation models with lightning attention,” January 2025. [Online]. Available: https://arxiv.org/abs/2501. 08313

work page 2025

[8] [8]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zhenget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” June 2023

work page 2023

[9] [9]

Hotpotqa: A dataset for diverse, explainable multi- hop question answering,

Z. Yanget al., “Hotpotqa: A dataset for diverse, explainable multi- hop question answering,” inProc. Conf. Empirical Methods in Natural Language Processing, Brussels, Belgium, October-November 2018, pp. 2369–2380

work page 2018

[10] [10]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps,

X. Ho, A. D. Nguyen, S. Sugawara, and A. Aizawa, “Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps,” inProc. Conf. Association for Computational Linguistics, Barcelona, Spain, December 2020, pp. 6609–6625

work page 2020

[11] [11]

Musique: Multihop questions via single-hop question composition,

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Musique: Multihop questions via single-hop question composition,”Proc. Conf. Association for Computational Linguistics, vol. 10, pp. 539–554, May 2022

work page 2022

[12] [12]

QMSum: A new benchmark for query-based multi- domain meeting summarization,

M. Zhonget al., “QMSum: A new benchmark for query-based multi- domain meeting summarization,” inProc. Conf. Association for Com- putational Linguistics, June 2021, pp. 5905–5921

work page 2021

[13] [13]

Efficient attentions for long document summarization,

L. Huang, S. Cao, N. Parulian, H. Ji, and L. Wang, “Efficient attentions for long document summarization,” inProc. Conf. Association for Computational Linguistics, June 2021, pp. 1419–1436

work page 2021

[14] [14]

An exploratory study on long dialogue summarization: What works and what’s next,

Y . Zhanget al., “An exploratory study on long dialogue summarization: What works and what’s next,” inProc. Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, November 2021, pp. 4426–4433

work page 2021

[15] [15]

Leveraging bilateral correlations for multi-label few-shot learning,

Y . An, H. Xue, X. Zhao, N. Xu, P. Fang, and X. Geng, “Leveraging bilateral correlations for multi-label few-shot learning,”IEEE Trans. Neural Netw. & Learn. Syst., vol. 36, no. 4, pp. 6816–6828, April 2025

work page 2025

[16] [16]

Few-shot learning with enhancements to data augmentation and feature extraction,

Y . Zhang, M. Gong, J. Li, K. Feng, and M. Zhang, “Few-shot learning with enhancements to data augmentation and feature extraction,”IEEE Trans. Neural Netw. & Learn. Syst., vol. 36, no. 4, pp. 6655–6668, June 2025

work page 2025

[17] [17]

Comparative analysis of zero-shot, few-shot, and one-shot learning in natural language processing,

M. R. Sudarshan and C. Kumbhar, “Comparative analysis of zero-shot, few-shot, and one-shot learning in natural language processing,” in2024 International Conference on Augmented Reality, Intelligent Systems, and Industrial Automation (ARIIA), December 2024, pp. 1–6

work page 2024

[18] [19]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

P. Lewiset al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in Neural Inf. Process. Syst., vol. 33, pp. 9459– 9474, May 2021

work page 2021

[19] [20]

Gemini: A Family of Highly Capable Multimodal Models

S. Chen, S. Wong, L. Chen, and Y . Tian, “Extending context window of large language models via positional interpolation,” June 2025. [Online]. Available: https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [21]

Roformer: En- hanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, November 2024

work page 2024

[21] [22]

Leave no context behind: Efficient infinite context transformers with infini-attention,

T. Munkhdalai, M. Faruqui, and S. Gopal, “Leave no context behind: Efficient infinite context transformers with infini-attention,” August

work page

[22] [23]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

[Online]. Available: https://arxiv.org/abs/2404.07143

work page internal anchor Pith review arXiv

[23] [24]

2-d transformer: Extending large language models to long-context with few memory,

X. He, J. Liu, and Y . Duan, “2-d transformer: Extending large language models to long-context with few memory,”IEEE Trans. Neural Netw. & Learn. Syst., vol. 36, no. 8, pp. 15 294–15 308, March 2025

work page 2025

[24] [25]

LONGAGENT: Achieving question answering for 128k- token-long documents through multi-agent collaboration,

J. Zhaoet al., “LONGAGENT: Achieving question answering for 128k- token-long documents through multi-agent collaboration,” inProc. Conf. Empirical Methods in Natural Language Processing, Miami, Florida, USA, November 2024, pp. 16 310–16 324

work page 2024

[25] [26]

Chain of agents: Large language models collaborating on long-context tasks,

Y . Zhang, R. Sun, Y . Chen, T. Pfister, R. Zhang, and S. Arik, “Chain of agents: Large language models collaborating on long-context tasks,” Advances in Neural Inf. Process. Syst., vol. 37, pp. 132 208–132 237, June 2024

work page 2024

[26] [27]

Jordan,Attractor dynamics and parallelism in a connectionist sequential machine

M. Jordan,Attractor dynamics and parallelism in a connectionist sequential machine. IEEE Press, 1990, p. 112–127

work page 1990

[27] [28]

Finding structure in time,

J. L. Elman, “Finding structure in time,”Cognitive science, vol. 14, no. 2, pp. 179–211, 1990

work page 1990

[28] [29]

Long short-term memory,

S. Hochreiter and J. Sch ¨urgen, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[29] [30]

Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory

C. Xiaoet al., “Infllm: Training-free long-context extrapolation for llms with an efficient context memory,” May 2024. [Online]. Available: https://arxiv.org/abs/2402.04617

work page arXiv 2024

[30] [31]

Long context compression with activation beacon,

P. Zhang, Z. Liu, S. Xiao, N. Shao, Q. Ye, and Z. Dou, “Long context compression with activation beacon,”Proc. Int. Conf. Learn. Representations, October 2024

work page 2024

[31] [32]

Landmark attention: Random-access in- finite context length for transformers,

A. Mohtashami and M. Jaggi, “Landmark attention: Random-access in- finite context length for transformers,”Advances in Neural Inf. Process. Syst., November 2023

work page 2023

[32] [33]

Examining inter- consistency of large language models collaboration: An in-depth analysis via debate,

K. Xiong, X. Ding, Y . Cao, T. Liu, and B. Qin, “Examining inter- consistency of large language models collaboration: An in-depth analysis via debate,” inProc. Conf. Empirical Methods in Natural Language Processing, Singapore, December 2023, pp. 7572–7590

work page 2023

[33] [34]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

L. Wanget al., “Text embeddings by weakly-supervised contrastive pre-training,” December 2024. [Online]. Available: https://arxiv.org/abs/ 2212.03533

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [35]

How to train your dragon: Diverse augmentation towards generalizable dense retrieval,

S. Lin and other, “How to train your dragon: Diverse augmentation towards generalizable dense retrieval,” inProc. Conf. Empirical Methods in Natural Language Processing, Singapore, December 2023, pp. 6385– 6400

work page 2023

[35] [36]

Cross-batch reference learning for deep retrieval,

H. Yang, K. Lin, T. Chen, and C. Chen, “Cross-batch reference learning for deep retrieval,” vol. 31, no. 9, pp. 3145–3158, September 2020

work page 2020

[36] [37]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffelet al., “Exploring the limits of transfer learning with a unified text-to-text transformer,”J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, September 2023

work page 2023

[37] [38]

Train short, test long: Attention with linear biases enables input length extrapolation,

O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,”Proc. Int. Conf. Learn. Representations, August 2021

work page 2021

[38] [39]

Clex: Continuous length extrapolation for large language models,

G. Chen, X. Li, Z. Meng, S. Liang, and L. Bing, “Clex: Continuous length extrapolation for large language models,”Proc. Int. Conf. Learn. Representations, October 2023

work page 2023

[39] [40]

Walking down the memory maze: Beyond context limit through interactive reading,

H. Chen, R. Pasunuru, J. Weston, and A. Celikyilmaz, “Walking down the memory maze: Beyond context limit through interactive reading,” October 2023. [Online]. Available: https://arxiv.org/abs/2310.05029

work page arXiv 2023

[40] [41]

TransAgents: Build your translation company with language agents,

M. Wu, J. Xu, and L. Wang, “TransAgents: Build your translation company with language agents,” inProc. Conf. Empirical Methods in Natural Language Processing, Miami, Florida, USA, November 2024, pp. 131–141

work page 2024

[41] [42]

Lstm: A search space odyssey,

K. Greff, R. K. Srivastava, J. Koutn ´ık, B. R. Steunebrink, and J. Schmid- huber, “Lstm: A search space odyssey,”IEEE Trans. Neural Netw. & Learn. Syst., vol. 28, no. 10, pp. 2222–2232, July 2017

work page 2017

[42] [43]

LongBench: A bilingual, multitask benchmark for long context understanding,

Y . Baiet al., “LongBench: A bilingual, multitask benchmark for long context understanding,” inProc. Conf. Association for Computational Linguistics. Bangkok, Thailand: Association for Computational Lin- guistics, August 2024, pp. 3119–3137

work page 2024

[43] [44]

The NarrativeQA reading comprehension challenge,

T. Ko ˇcisk`yet al., “The NarrativeQA reading comprehension challenge,” Proc. Conf. Association for Computational Linguistics, vol. 6, pp. 317– 328, 2018

work page 2018

[44] [45]

A dataset of information-seeking questions and answers anchored in research papers,

P. Dasigi, K. Lo, I. Beltagy, A. Cohan, N. A. Smith, and M. Gardner, “A dataset of information-seeking questions and answers anchored in research papers,” inProc. Conf. Association for Computational Linguis- tics, Online, June 2021, pp. 4599–4610

work page 2021

[45] [46]

SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization,

B. Gliwa, I. Mochol, M. Biesek, and A. Wawer, “SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization,” in Proceedings of the 2nd Workshop on New Frontiers in Summarization, Hong Kong, China, November 2019, pp. 70–79

work page 2019

[46] [47]

Qwen2.5 Technical Report

A. Yanget al., “Qwen2. 5 technical report,” December 2025. [Online]. Available: https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

M. Abdin, J. Aneja, and H. Awadalla, “Phi-3 technical report: A highly capable language model locally on your phone,” April 2024. [Online]. Available: https://arxiv.org/abs/2404.14219

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [49]

ERNIE: Enhanced language representation with informative entities,

Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ERNIE: Enhanced language representation with informative entities,” inProc. Conf. Association for Computational Linguistics, Florence, Italy, July 2019, pp. 1441–1451. LSTM-MAS: A LONG SHORT-TERM MEMORY INSPIRED MULTI-AGENT SYSTEM FOR LONG-CONTEXT UNDERSTANDING 12

work page 2019

[49] [50]

LightRAG: Simple and fast retrieval-augmented generation,

Z. Guo, L. Xia, Y . Yu, A. Tu, and C. Huang, “LightRAG: Simple and fast retrieval-augmented generation,” inProc. Conf. Empirical Methods in Natural Language Processing, Suzhou, China, November 2025, pp. 10 746–10 761

work page 2025