pith. sign in

Not all llm reasoners are created equal.arXiv preprint arXiv:2410.01748

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

citation-role summary

background 2

citation-polarity summary

years

2026 3 2025 1

verdicts

UNVERDICTED 4

roles

background 2

polarities

background 1 unclear 1

representative citing papers

Agentic Frameworks for Reasoning Tasks: An Empirical Study

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

Humanity's Last Exam

cs.LG · 2025-01-24 · unverdicted · novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

citing papers explorer

Showing 4 of 4 citing papers.

  • LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning cs.AI · 2026-03-06 · unverdicted · none · ref 2

    LEAD lets LLMs solve checkers jumping puzzles up to size 13 by using lookahead to recover from irreversible errors on hard steps that break extreme decomposition.

  • Agentic Frameworks for Reasoning Tasks: An Empirical Study cs.AI · 2026-04-17 · unverdicted · none · ref 62

    An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

  • How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence cs.CL · 2026-03-18 · unverdicted · none · ref 35

    AI's compositional reasoning failures originate in psychological learning paradigms that shaped its architectures, and the ReSynth trimodular framework is proposed to embed systematicity structurally.

  • Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 26

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.