Cognitive load limits in large language models: Benchmarking multi-hop reasoning

Sai Teja Reddy Adapala · 2025 · arXiv 2509.19517

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

cs.AI · 2026-03-28 · unverdicted · novelty 7.0

WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

cs.CL · 2026-01-27 · conditional · novelty 7.0

Iterative RAG outperforms Gold Context RAG by up to 25.6 points on ChemKGMultiHopQA across 11 LLMs, mainly by staging retrieval to avoid context overload and correct hypothesis drift.

citing papers explorer

Showing 2 of 2 citing papers.

WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking cs.AI · 2026-03-28 · unverdicted · none · ref 42
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering cs.CL · 2026-01-27 · conditional · none · ref 1
Iterative RAG outperforms Gold Context RAG by up to 25.6 points on ChemKGMultiHopQA across 11 LLMs, mainly by staging retrieval to avoid context overload and correct hypothesis drift.

Cognitive load limits in large language models: Benchmarking multi-hop reasoning

fields

years

verdicts

representative citing papers

citing papers explorer