hub

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt · 2021

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

browse 10 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

dataset 3

citation-polarity summary

use dataset 2 background 1

representative citing papers

CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

CTF4Nuclear proposes a common task framework for benchmarking ML methods on nuclear engineering datasets using 12 metrics and a new sparse-measurement system monitoring paradigm.

MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

MAS-Algorithm is a multi-agent workflow that improves AI acceptance rates on algorithmic problems by 6.48% on average, outperforming parameter-efficient fine-tuning.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

cs.CL · 2025-05-30 · conditional · novelty 6.0

Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

cs.LG · 2024-07-31 · unverdicted · novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

Lessons from the Trenches on Reproducible Evaluation of Language Models

cs.CL · 2024-05-23 · accept · novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

cs.AI · 2026-05-20 · unverdicted · novelty 5.0

MindLoom synthesizes frontier-level reasoning data by decomposing solutions into thought mode chains, training a retrieval model for mode selection, composing new problems with distribution-aligned sampling, and applying rollout-based difficulty labeling for fine-tuning.

Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature

cs.CL · 2025-09-20 · unverdicted · novelty 5.0

HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.

STELLA: A Multimodal LLM for Protein Functional Annotation via Unified Sequence-Structure Encoding

q-bio.BM · 2025-06-04 · unverdicted · novelty 5.0

STELLA aligns ESM3 bimodal sequence-structure encodings with Llama-3.1-8B text modeling to claim state-of-the-art results on protein functional description prediction and enzyme-catalyzed reaction prediction.

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong

cs.CL · 2025-01-16 · unverdicted · novelty 5.0

Reasoning before answering MCQs increases LLM confidence more for incorrect answers and degrades calibration on a 57-subject benchmark across seven models.

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

cs.AI · 2025-01-16 · unverdicted · novelty 3.0

The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

citing papers explorer

Showing 10 of 10 citing papers.

CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models cs.LG · 2026-05-15 · unverdicted · none · ref 35
CTF4Nuclear proposes a common task framework for benchmarking ML methods on nuclear engineering datasets using 12 metrics and a new sparse-measurement system monitoring paradigm.
MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System cs.AI · 2026-05-07 · unverdicted · none · ref 24 · 2 links
MAS-Algorithm is a multi-agent workflow that improves AI acceptance rates on algorithmic problems by 6.48% on average, outperforming parameter-efficient fine-tuning.
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models cs.CL · 2025-05-30 · conditional · none · ref 25
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling cs.LG · 2024-07-31 · unverdicted · none · ref 28
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · accept · none · ref 279
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis cs.AI · 2026-05-20 · unverdicted · none · ref 15
MindLoom synthesizes frontier-level reasoning data by decomposing solutions into thought mode chains, training a retrieval model for mode selection, composing new problems with distribution-aligned sampling, and applying rollout-based difficulty labeling for fine-tuning.
Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature cs.CL · 2025-09-20 · unverdicted · none · ref 10
HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.
STELLA: A Multimodal LLM for Protein Functional Annotation via Unified Sequence-Structure Encoding q-bio.BM · 2025-06-04 · unverdicted · none · ref 33
STELLA aligns ESM3 bimodal sequence-structure encodings with Llama-3.1-8B text modeling to claim state-of-the-art results on protein functional description prediction and enzyme-catalyzed reaction prediction.
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong cs.CL · 2025-01-16 · unverdicted · none · ref 6
Reasoning before answering MCQs increases LLM confidence more for incorrect answers and degrades calibration on a 57-subject benchmark across seven models.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models cs.AI · 2025-01-16 · unverdicted · none · ref 50
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

Measuring mathematical problem solving with the math dataset

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer