CTF4Nuclear proposes a common task framework for benchmarking ML methods on nuclear engineering datasets using 12 metrics and a new sparse-measurement system monitoring paradigm.
hub
Measuring mathematical problem solving with the math dataset
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
dataset 3representative citing papers
MAS-Algorithm is a multi-agent workflow that improves AI acceptance rates on algorithmic problems by 6.48% on average, outperforming parameter-efficient fine-tuning.
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
MindLoom synthesizes frontier-level reasoning data by decomposing solutions into thought mode chains, training a retrieval model for mode selection, composing new problems with distribution-aligned sampling, and applying rollout-based difficulty labeling for fine-tuning.
HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.
STELLA aligns ESM3 bimodal sequence-structure encodings with Llama-3.1-8B text modeling to claim state-of-the-art results on protein functional description prediction and enzyme-catalyzed reaction prediction.
Reasoning before answering MCQs increases LLM confidence more for incorrect answers and degrades calibration on a 57-subject benchmark across seven models.
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
citing papers explorer
-
CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models
CTF4Nuclear proposes a common task framework for benchmarking ML methods on nuclear engineering datasets using 12 metrics and a new sparse-measurement system monitoring paradigm.
-
MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System
MAS-Algorithm is a multi-agent workflow that improves AI acceptance rates on algorithmic problems by 6.48% on average, outperforming parameter-efficient fine-tuning.
-
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
Lessons from the Trenches on Reproducible Evaluation of Language Models
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
-
MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis
MindLoom synthesizes frontier-level reasoning data by decomposing solutions into thought mode chains, training a retrieval model for mode selection, composing new problems with distribution-aligned sampling, and applying rollout-based difficulty labeling for fine-tuning.
-
Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature
HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.
-
STELLA: A Multimodal LLM for Protein Functional Annotation via Unified Sequence-Structure Encoding
STELLA aligns ESM3 bimodal sequence-structure encodings with Llama-3.1-8B text modeling to claim state-of-the-art results on protein functional description prediction and enzyme-catalyzed reaction prediction.
-
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong
Reasoning before answering MCQs increases LLM confidence more for incorrect answers and degrades calibration on a 57-subject benchmark across seven models.
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.