hub Canonical reference

Arc prize 2024: Technical report

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers · 2024 · arXiv 2412.04604

Canonical reference. 100% of citing Pith papers cite this work as background.

16 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.

Factorization Regret mediates compositional generalization in latent space

cs.LG · 2026-03-28 · unverdicted · novelty 7.0

Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

cs.AI · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.

One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

cs.AI · 2025-10-09 · unverdicted · novelty 6.0

Introduces group matching score for better evaluation of compositional reasoning and Test-Time Matching (TTM) algorithm for unsupervised self-improvement in multimodal models, achieving SOTA gains including surpassing GPT-4.1 and estimated human performance.

Artificial Phantasia: Emergent Mental Imagery in Large Language Models

cs.AI · 2025-09-27 · unverdicted · novelty 6.0

LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

cs.AI · 2025-05-17 · unverdicted · novelty 6.0

ARC-AGI-2 adds a larger, more complex set of tasks to the original ARC-AGI benchmark to give finer-grained measurement of fluid intelligence in AI.

Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

cs.AI · 2026-05-11 · unverdicted · novelty 5.0

The authors propose creating data probes—synthetic sequences from defined random processes—to reveal how data properties drive LLM behavior across workflow stages.

Beyond Tools and Persons: Who Are They? Classifying Robots and AI Agents for Proportional Governance

cs.ET · 2026-04-07 · unverdicted · novelty 5.0

A CPST-based taxonomy sorts autonomous systems into Confined Actors, Socially-Aware Interactors, and CPST-Integrated Agents to enable proportional governance from enhanced liability to qualified personhood.

Hierarchical Reasoning Model

cs.AI · 2025-06-26 · unverdicted · novelty 5.0

HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Humanity's Last Exam

cs.LG · 2025-01-24 · unverdicted · novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

OpenCompass: A Universal Evaluation Platform for Large Language Models

cs.CL · 2026-05-19 · conditional · novelty 4.0

OpenCompass is a modular, high-concurrency platform for unified LLM evaluation across knowledge, reasoning, code, and other domains with support for rule-based, LLM-as-judge, and cascaded evaluators.

Customizing an LLM for Enterprise Software Engineering

cs.SE · 2026-05-15 · unverdicted · novelty 4.0 · 2 refs

Gemini for Google, customized via continued pre-training on proprietary Google engineering data, delivers measurable productivity gains in a large internal developer study.

Measuring AI Reasoning: A Guide for Researchers

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

citing papers explorer

Showing 16 of 16 citing papers.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs cs.LG · 2026-05-08 · unverdicted · none · ref 11
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics cs.AI · 2026-05-13 · unverdicted · none · ref 8
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
Factorization Regret mediates compositional generalization in latent space cs.LG · 2026-03-28 · unverdicted · none · ref 19
Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 12 · 3 links
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models cs.LG · 2026-04-20 · unverdicted · none · ref 162
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models cs.AI · 2025-10-09 · unverdicted · none · ref 5
Introduces group matching score for better evaluation of compositional reasoning and Test-Time Matching (TTM) algorithm for unsupervised self-improvement in multimodal models, achieving SOTA gains including surpassing GPT-4.1 and estimated human performance.
Artificial Phantasia: Emergent Mental Imagery in Large Language Models cs.AI · 2025-09-27 · unverdicted · none · ref 10
LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems cs.AI · 2025-05-17 · unverdicted · none · ref 10
ARC-AGI-2 adds a larger, more complex set of tasks to the original ARC-AGI benchmark to give finer-grained measurement of fluid intelligence in AI.
Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance cs.AI · 2026-05-11 · unverdicted · none · ref 5
The authors propose creating data probes—synthetic sequences from defined random processes—to reveal how data properties drive LLM behavior across workflow stages.
Beyond Tools and Persons: Who Are They? Classifying Robots and AI Agents for Proportional Governance cs.ET · 2026-04-07 · unverdicted · none · ref 11
A CPST-based taxonomy sorts autonomous systems into Confined Actors, Socially-Aware Interactors, and CPST-Integrated Agents to enable proportional governance from enhanced liability to qualified personhood.
Hierarchical Reasoning Model cs.AI · 2025-06-26 · unverdicted · none · ref 29
HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 135
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 13
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
OpenCompass: A Universal Evaluation Platform for Large Language Models cs.CL · 2026-05-19 · conditional · none · ref 2
OpenCompass is a modular, high-concurrency platform for unified LLM evaluation across knowledge, reasoning, code, and other domains with support for rule-based, LLM-as-judge, and cascaded evaluators.
Customizing an LLM for Enterprise Software Engineering cs.SE · 2026-05-15 · unverdicted · none · ref 5 · 2 links
Gemini for Google, customized via continued pre-training on proprietary Google engineering data, delivers measurable productivity gains in a large internal developer study.
Measuring AI Reasoning: A Guide for Researchers cs.AI · 2026-05-04 · unverdicted · none · ref 159
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

Arc prize 2024: Technical report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer