Controlled ablation finds Popperian code-generation skill adds no separable correctness benefit over labels-only scaffold; gains track structure not content.
URL https://www.nature.com/articles/ s41586-024-07930-y
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
POIROT protocol repurposes agents in LLM multi-agent systems as an internal diagnostic layer for failure detection, outperforming single-LLM evaluators with gains that increase with complexity, agent count, and fault types.
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
GPT produces click distributions significantly different from real humans in 53% of UX first-click tasks, with prompting techniques like personas and chain-of-thought failing to improve alignment.
Proposes RAP, a retrieval-based approximate prior method, to predict performance of symbolic programs and LLM prompts on new tasks using a Bernoulli model and corpus-derived performance distributions.
The paper introduces a reproducible optimization protocol for prompt-based LLM workflows in evidence synthesis that separates task definitions from prompt harnesses, optimizes the harness against metrics and examples, and preserves the result as an inspectable artefact.
citing papers explorer
-
Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill
Controlled ablation finds Popperian code-generation skill adds no separable correctness benefit over labels-only scaffold; gains track structure not content.
-
POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems
POIROT protocol repurposes agents in LLM multi-agent systems as an internal diagnostic layer for failure detection, outperforming single-LLM evaluators with gains that increase with complexity, agent count, and fault types.
-
Consistency Training while Mitigating Obfuscation via Rate Matching
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
-
What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience
GPT produces click distributions significantly different from real humans in 53% of UX first-click tasks, with prompting techniques like personas and chain-of-thought failing to improve alignment.
-
Predicting Performance of Symbolic and Prompt Programs with Examples
Proposes RAP, a retrieval-based approximate prior method, to predict performance of symbolic programs and LLM prompts on new tasks using a Bernoulli model and corpus-derived performance distributions.
-
A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis
The paper introduces a reproducible optimization protocol for prompt-based LLM workflows in evidence synthesis that separates task definitions from prompt harnesses, optimizes the harness against metrics and examples, and preserves the result as an inspectable artefact.