RubricRefine improves tool-use agent reliability to 0.86 on M3ToolEval by generating rubrics for pre-execution contract checking and iterative repair, outperforming baselines at 2.6X lower latency while showing no gain on single-step tasks.
When can LLMs actually correct their own mistakes? A survey of self-correction
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
An open-source local linter verifies reference integrity and claim support in scientific manuscripts using public databases and consumer hardware, with an experimental contribution scoring extension.
A two-rate measurement (correction c and corruption γ) for LLM protocol steps predicts accuracy changes from paired correctness bits and flags three failure modes including mixture shift on GSM8K.
citing papers explorer
-
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
RubricRefine improves tool-use agent reliability to 0.86 on M3ToolEval by generating rubrics for pre-execution contract checking and iterative repair, outperforming baselines at 2.6X lower latency while showing no gain on single-step tasks.
-
Weighted Rules under the Stable Model Semantics
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
-
ReflectCAP: Detailed Image Captioning with Reflective Memory
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
-
sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing
An open-source local linter verifies reference integrity and claim support in scientific manuscripts using public databases and consumer hardware, with an experimental contribution scoring extension.
-
Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols
A two-rate measurement (correction c and corruption γ) for LLM protocol steps predicts accuracy changes from paired correctness bits and flags three failure modes including mixture shift on GSM8K.