The Complexity Ceiling Benchmark demonstrates geometric per-step decay in LLM sequential reasoning with domain-specific performance ceilings and introduces a trace metric showing incorrect intermediate steps in some correct final answers.
Clutrr: A diagnostic benchmark for inductive reasoning from text
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.
citing papers explorer
-
The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling
The Complexity Ceiling Benchmark demonstrates geometric per-step decay in LLM sequential reasoning with domain-specific performance ceilings and introduces a trace metric showing incorrect intermediate steps in some correct final answers.
-
DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination
DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.
-
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.