Counterexample Guided Learning in the Large using Reasoning Agents

Adithya Murali; Frederic Sala; Hongyi Liu; Thomas Reps

arxiv: 2606.11521 · v1 · pith:3CU5Z5SWnew · submitted 2026-06-09 · 💻 cs.LG

Counterexample Guided Learning in the Large using Reasoning Agents

Hongyi Liu , Frederic Sala , Thomas Reps , Adithya Murali This is my paper

Pith reviewed 2026-06-27 13:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords regex inductioncounterexample-guided learningLLM agentsprogram synthesisverifier feedbacksample efficiencyformal reasoningagentic refinement

0 comments

The pith

Verifier counterexamples let LLMs learn complex regular expressions with far higher success and fewer examples than standard prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether LLMs can improve at the classical task of inducing regular expressions when given precise feedback in the form of counterexamples that distinguish wrong candidates from the target language. It introduces refinement strategies such as regularization and symbolic clustering of counterexamples, together with agentic loops for reflection and repair. These mechanisms allow the model to iteratively correct proposals using the verifier's signals. On the most difficult task groups the approach raises success rates from 3.2 percent to 38.1 percent and from 38.9 percent to 74.1 percent across two regex domains while lowering the number of labeled strings required. The results indicate that LLMs can treat rich, structured feedback as more than additional data for symbolic induction tasks.

Core claim

In counterexample-guided learning a learner LLM proposes candidate regular expressions from positive and negative strings while the verifier returns counterexamples that expose mismatches with the target language; novel strategies of regularization, symbolic counterexample clustering, and agentic reflection-repair loops turn this feedback into effective refinements, producing substantially higher success on hard induction problems where baseline prompting fails.

What carries the argument

Counterexample-guided refinement strategies (regularization and symbolic clustering of counterexamples) combined with agentic reflection and repair loops that turn verifier signals into iterative candidate improvements.

If this is right

Structured verifier feedback can raise sample efficiency in inductive learning tasks that standard prompting cannot solve.
Agentic loops for reflection and repair become practical when paired with domain-specific counterexample signals.
The same feedback pattern may support verifier-guided synthesis of other formal artifacts such as programs or logical formulas.
Complex target expressions become reachable once counterexamples are used to focus the search rather than treated as extra training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on induction problems in other formal languages where counterexamples are computable, such as context-free grammars or simple temporal logics.
If the feedback mechanism generalizes, it may lower the data volume needed to reach reliable performance on symbolic reasoning benchmarks.
Combining the loop with external symbolic solvers for candidate validation could further reduce hallucinations in the refinement step.

Load-bearing premise

The LLM can reliably fold the verifier's counterexamples into better proposals without systematically ignoring the feedback or introducing new errors.

What would settle it

Run the same hard regex tasks with the counterexample-guided strategies disabled versus enabled and observe whether success rates and required example counts remain statistically indistinguishable from the standard-prompting baseline.

Figures

Figures reproduced from arXiv: 2606.11521 by Adithya Murali, Frederic Sala, Hongyi Liu, Thomas Reps.

**Figure 2.** Figure 2: Overview of our iterative counterexample-guided agentic refinement framework with a teacher and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pareto frontiers of the main results on Simple and Extended Regex. Points at the same [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Success rate heatmaps across regex complexity for main Simple Regex experiments. Rows correspond [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Success rate heatmaps across regex complexity for main Extended Regex experiments. Rows [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Evaluation accuracy of hypothesis selected under sample budget on each [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Evaluation accuracy of hypothesis selected under sample budget on each [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Pareto frontiers across full ablation settings on Simple and Extended Regex Datasets. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain-specific, and difficult to control. We approach this challenge by asking LLMs to perform regular-expression induction, a classical symbolic learning problem where precise mechanisms for feedback exist in the form of counterexamples. In counterexample-guided learning, a learner (LLM) proposes candidate regular expressions from positive/negative-labeled strings, and the teacher (verifier) returns counterexamples showcasing the difference between the candidate and target languages. We identify novel counterexample-guided refinement strategies that enable effective regex learning, such as regularization and symbolic counterexample clusters. We also explore agentic strategies such as reflection and repair loops. Empirically, we find that verifier feedback substantially improves sample efficiency on challenging regex-induction tasks, reducing the number of labeled examples required and enabling learning of complex target expressions where standard prompting fails. For example, on the hardest task groups, our counterexample-guided framework improves success from 3.2% to 38.1% and from 38.9% to 74.1% on two different regex domains. These results suggest that LLMs can benefit from rich feedback beyond treating it as additional data, opening the door for robust verifier-guided methods for LLM-based program synthesis and formal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows concrete gains from counterexample feedback on LLM regex induction, but the gains rest on an unverified assumption that the models actually use the feedback.

read the letter

The core result is that adding verifier counterexamples, plus regularization and clustering, lifts success rates on hard regex tasks from low single digits to the 30-70% range. That is the thing to take away.

What is new is the specific mix: they treat regex induction as a testbed where feedback is exact, then layer in symbolic clustering of counterexamples and agentic repair loops on top of standard prompting. The setup lets them measure sample efficiency directly against a classical symbolic learning baseline.

The work is clean on the problem choice. Regex induction gives controllable positive/negative examples and a verifier that returns precise witnesses, which is better than vague natural-language feedback. The reported lifts on the hardest task groups are large enough to be worth checking.

The soft spot is exactly the one the stress-test flags. The numbers only move if the LLM conditions its next guess on the returned counterexamples rather than ignoring them or hallucinating fixes. The abstract gives no per-trace adherence stats, no breakdown of how often the model discards the witness, and no controls for extra interaction rounds. Without those, the improvement could be an artifact of more turns instead of genuine use of the signal.

Methods details are also thin in the abstract: no mention of statistical tests, data splits, or whether tasks were post-selected. Full text would need to supply those before the claim is solid.

This is for people working on verifier-guided LLM agents for synthesis or formal methods. A reader already thinking about counterexample loops will get the most out of it.

It deserves a serious referee. The testbed is useful and the empirical direction is worth checking, even if the current write-up leaves the key assumption under-supported.

Referee Report

2 major / 2 minor

Summary. The paper claims that counterexample-guided refinement strategies (regularization, symbolic clustering) and agentic loops (reflection, repair) allow LLMs to perform regex induction more effectively than standard prompting. Verifier feedback is said to improve sample efficiency and enable learning of complex expressions, with reported success-rate gains on the hardest task groups from 3.2% to 38.1% and from 38.9% to 74.1% across two regex domains.

Significance. If the empirical results hold under controlled conditions, the work provides concrete evidence that LLMs can condition on structured verifier feedback (counterexamples) rather than treating it as additional data, with potential implications for verifier-guided program synthesis and formal reasoning tasks.

major comments (2)

[Abstract / Experiments] Abstract and experimental sections: the manuscript reports quantitative gains (3.2% → 38.1%, 38.9% → 74.1%) but supplies no details on experimental controls, statistical significance, data splits, number of runs, or whether the percentages reflect post-hoc task selection; these omissions make it impossible to assess whether the central empirical claim is supported.
[Refinement Strategies / Agentic Strategies] Refinement strategies and agentic loops sections: the headline claim presupposes that the LLM reliably conditions its next candidate on the verifier’s returned counterexamples. No per-trace adherence statistics, failure-mode breakdowns, or ablation isolating feedback use versus extra interaction rounds are supplied, leaving open the possibility that measured lifts are artifacts of additional rounds rather than genuine use of the counterexample signal.

minor comments (2)

[Evaluation Metrics] Clarify whether the reported success rates are per-task or aggregated, and define the exact criteria used to declare a regex “learned.”
[Introduction] The abstract states that feedback is “heterogeneous, domain-specific, and difficult to control”; the regex setting is presented as a controlled testbed, but the paper does not discuss how the chosen domains generalize beyond regex induction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting these important gaps in experimental reporting and analysis. We agree that additional details and ablations are needed to substantiate the claims and will revise the manuscript accordingly. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental sections: the manuscript reports quantitative gains (3.2% → 38.1%, 38.9% → 74.1%) but supplies no details on experimental controls, statistical significance, data splits, number of runs, or whether the percentages reflect post-hoc task selection; these omissions make it impossible to assess whether the central empirical claim is supported.

Authors: We acknowledge the omission of these controls in the submitted manuscript. In the revision we will add a new subsection 'Experimental Protocol' that specifies: (i) 5 independent runs per method with distinct random seeds for temperature sampling, (ii) bootstrap 95% confidence intervals and paired Wilcoxon tests on success rates, (iii) the exact procedure for generating and splitting positive/negative example sets (fixed complexity-based generation with held-out test strings), and (iv) that the two 'hardest task groups' were defined a priori by regex operator count and alphabet size before any model runs, with no post-hoc filtering. These additions will make the reported gains reproducible and statistically grounded. revision: yes
Referee: [Refinement Strategies / Agentic Strategies] Refinement strategies and agentic loops sections: the headline claim presupposes that the LLM reliably conditions its next candidate on the verifier’s returned counterexamples. No per-trace adherence statistics, failure-mode breakdowns, or ablation isolating feedback use versus extra interaction rounds are supplied, leaving open the possibility that measured lifts are artifacts of additional rounds rather than genuine use of the counterexample signal.

Authors: We agree this analysis is missing and constitutes a genuine limitation of the current draft. The revision will include: (1) per-trace adherence rates computed automatically by checking whether each new candidate accepts/rejects the exact counterexamples returned in that round, (2) a qualitative failure-mode taxonomy (ignoring feedback, producing invalid syntax, failing to generalize beyond the counterexample) with counts from 200 sampled traces, and (3) a controlled ablation that keeps the identical number of interaction rounds but replaces counterexample feedback with a generic 'the candidate is incorrect; try again' message. The ablation results will be reported alongside the main tables to isolate the contribution of the structured counterexample signal. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential predictions

full rationale

The paper reports measured success rates (e.g., 3.2% to 38.1%) from LLM-based regex induction experiments using counterexample feedback. These are direct experimental outcomes, not quantities derived from equations, fitted parameters, or self-citations that reduce to the inputs by construction. No mathematical derivation chain, uniqueness theorems, or ansatzes are present; the work consists of described strategies and empirical evaluation on regex domains. The central claims rest on observed performance lifts rather than any self-definitional or load-bearing self-citation structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated premise that the verifier produces counterexamples that are both correct and usable by the LLM.

pith-pipeline@v0.9.1-grok · 5773 in / 1049 out tokens · 16609 ms · 2026-06-27T13:26:45.371785+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 2 canonical work pages

[1]

Pranav Garg, Christof Löding, P

doi: 10.1142/S0218194020400203. Pranav Garg, Christof Löding, P. Madhusudan, and Daniel Neider. Ice: A robust framework for learning invariants. InComputer Aided Verification, volume 8559 ofLecture Notes in Computer Science, pages 69–87. Springer, 2014. doi: 10.1007/978-3-319-08867-9_5. 11 Pranav Garg, Daniel Neider, P. Madhusudan, and Dan Roth. Learning ...

work page doi:10.1142/s0218194020400203 2014
[2]

doi: 10.1016/j.neucom

URLhttps://arxiv.org/abs/2505.13445. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URLh...

work page doi:10.1016/j.neucom 2023
[3]

GPT” OpenAI [2026] and “Qwen

(also “GPT” OpenAI [2026] and “Qwen” Team [2025] models for model ablation). For the main experiments, we perform three trials per regex for robustness and reliability. And from RQ3 to RQ5 in Section 5, we perform only one trial per regex for relevant experiments. For all the experiments, we set the maximum context-window length to be65536, the maximum ou...

2026
[4]

Check start/end constraints
[5]

Look for block structure and repetition
[7]

Prefer compact factorizations
[8]

Handle epsilon only when supported
[10]

Training Data (Each line has one input-output pair separated by comma):

Verify all positives and negatives before finalizing. {regularization_instr} {agentic_reflection_instr} OUTPUT FORMAT - First, provide 1-3 concise sentences wrapped in <reasoning>...</reasoning>. - Then output ONLY the final regex wrapped in <ans>...</ans>. Training Data (Each line has one input-output pair separated by comma): {0} Base Prompt: Extended R...
[11]

Check prefix and suffix constraints
[12]

Look for repeated substrings and block structure
[13]

Choose between star-of-union and union-of-stars
[14]

Factor common prefixes/suffixes and use character classes when appropriate
[15]

Handle epsilon only when required. 20
[16]

Avoid over-generalization
[17]

{regularization_instr} {agentic_reflection_instr} OUTPUT FORMAT - First, provide 1-3 concise sentences wrapped in <reasoning>...</reasoning>

Verify positives, negatives, and boundary cases. {regularization_instr} {agentic_reflection_instr} OUTPUT FORMAT - First, provide 1-3 concise sentences wrapped in <reasoning>...</reasoning>. - Then output ONLY the final regex wrapped in <ans>...</ans>. Training Data (Each line has one input-output pair separated by comma): {0} Regularization Blocks. Simpl...
[18]

Produce a regex that compiles under the required regex syntax
[19]

Fix the specific mistakes exposed by the counterexamples, disagreement witness, and any reported errors
[20]

- If the previous regex is invalid, first correct the syntax

Preserve the parts of the previous solution that still fit the training data. - If the previous regex is invalid, first correct the syntax. - If a witness or repair example is handled incorrectly, revise the regex so that string gets the correct label. - Keep the reasoning concise and focused on what changed. - Return the repaired regex in the required ou...

[1] [1]

Pranav Garg, Christof Löding, P

doi: 10.1142/S0218194020400203. Pranav Garg, Christof Löding, P. Madhusudan, and Daniel Neider. Ice: A robust framework for learning invariants. InComputer Aided Verification, volume 8559 ofLecture Notes in Computer Science, pages 69–87. Springer, 2014. doi: 10.1007/978-3-319-08867-9_5. 11 Pranav Garg, Daniel Neider, P. Madhusudan, and Dan Roth. Learning ...

work page doi:10.1142/s0218194020400203 2014

[2] [2]

doi: 10.1016/j.neucom

URLhttps://arxiv.org/abs/2505.13445. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URLh...

work page doi:10.1016/j.neucom 2023

[3] [3]

GPT” OpenAI [2026] and “Qwen

(also “GPT” OpenAI [2026] and “Qwen” Team [2025] models for model ablation). For the main experiments, we perform three trials per regex for robustness and reliability. And from RQ3 to RQ5 in Section 5, we perform only one trial per regex for relevant experiments. For all the experiments, we set the maximum context-window length to be65536, the maximum ou...

2026

[4] [4]

Check start/end constraints

[5] [5]

Look for block structure and repetition

[6] [7]

Prefer compact factorizations

[7] [8]

Handle epsilon only when supported

[8] [10]

Training Data (Each line has one input-output pair separated by comma):

Verify all positives and negatives before finalizing. {regularization_instr} {agentic_reflection_instr} OUTPUT FORMAT - First, provide 1-3 concise sentences wrapped in <reasoning>...</reasoning>. - Then output ONLY the final regex wrapped in <ans>...</ans>. Training Data (Each line has one input-output pair separated by comma): {0} Base Prompt: Extended R...

[9] [11]

Check prefix and suffix constraints

[10] [12]

Look for repeated substrings and block structure

[11] [13]

Choose between star-of-union and union-of-stars

[12] [14]

Factor common prefixes/suffixes and use character classes when appropriate

[13] [15]

Handle epsilon only when required. 20

[14] [16]

Avoid over-generalization

[15] [17]

{regularization_instr} {agentic_reflection_instr} OUTPUT FORMAT - First, provide 1-3 concise sentences wrapped in <reasoning>...</reasoning>

Verify positives, negatives, and boundary cases. {regularization_instr} {agentic_reflection_instr} OUTPUT FORMAT - First, provide 1-3 concise sentences wrapped in <reasoning>...</reasoning>. - Then output ONLY the final regex wrapped in <ans>...</ans>. Training Data (Each line has one input-output pair separated by comma): {0} Regularization Blocks. Simpl...

[16] [18]

Produce a regex that compiles under the required regex syntax

[17] [19]

Fix the specific mistakes exposed by the counterexamples, disagreement witness, and any reported errors

[18] [20]

- If the previous regex is invalid, first correct the syntax

Preserve the parts of the previous solution that still fit the training data. - If the previous regex is invalid, first correct the syntax. - If a witness or repair example is handled incorrectly, revise the regex so that string gets the correct label. - Keep the reasoning concise and focused on what changed. - Return the repaired regex in the required ou...