The Complexity Ceiling Benchmark demonstrates geometric per-step decay in LLM sequential reasoning with domain-specific performance ceilings and introduces a trace metric showing incorrect intermediate steps in some correct final answers.
Clutrr: A diagnostic benchmark for inductive reasoning from text
5 Pith papers cite this work. Polarity classification is still indexing.
5
Pith papers citing it
representative citing papers
DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.