ProcBench: Benchmark for multi-step reasoning and following procedure.ArXiv preprint, abs/2410.03117, 2024

Fujisawa, I · 2024 · arXiv 2410.03117

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

cs.SE · 2026-05-18 · unverdicted · novelty 7.0

ProcCtrlBench introduces an ontology of 11 defect types across 4 categories plus control preservation metrics to evaluate LLM coding agent trajectories on 200 cases from AndroidBench, TerminalBench, and SWE-bench-Verified.

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

cs.CL · 2026-06-09 · unverdicted · novelty 5.0

Continual training recipe upcycles dense Qwen2.5-8B LLM to 4x channel-sparse model via predictor-gated bank-wise sparsity in SwiGLU FFN with a single-layer repair for long-context failure on RULER-CWE.

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

cs.AI · 2026-06-11 · unverdicted · novelty 4.0

Strict generation directly from Task-Method-Knowledge models yields 96.5% grounded and 92.6% usable QA pairs across 23 topics, outperforming transcript-first and TMK-aware alternatives on representational grounding.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs cs.CL · 2026-06-09 · unverdicted · none · ref 23
Continual training recipe upcycles dense Qwen2.5-8B LLM to 4x channel-sparse model via predictor-gated bank-wise sparsity in SwiGLU FFN with a single-layer repair for long-context failure on RULER-CWE.

ProcBench: Benchmark for multi-step reasoning and following procedure.ArXiv preprint, abs/2410.03117, 2024

fields

years

verdicts

representative citing papers

citing papers explorer