WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
Cognitive load limits in large language models: Benchmarking multi-hop reasoning
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Iterative RAG outperforms Gold Context RAG by up to 25.6 points on ChemKGMultiHopQA across 11 LLMs, mainly by staging retrieval to avoid context overload and correct hypothesis drift.
citing papers explorer
-
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
-
When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering
Iterative RAG outperforms Gold Context RAG by up to 25.6 points on ChemKGMultiHopQA across 11 LLMs, mainly by staging retrieval to avoid context overload and correct hypothesis drift.