The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models

Daniel Drucker; Kyle Mahowald

arxiv: 2605.03936 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI

The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models

Daniel Drucker , Kyle Mahowald This is my paper

Pith reviewed 2026-05-07 16:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsconceptual analysiscounterexamplesphilosophical reasoningdefinition refinementiterated processesvalidity assessmentartificial intelligence

0 comments

The pith

Language models can engage in philosophical conceptual analysis via counterexample generation and definition repair, though the process quickly reaches diminishing returns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates if language models can replicate the core philosophical practice of defining concepts and improving those definitions by responding to counterexamples. It does so by running repeated cycles in which one model proposes counterexamples to a definition and another model revises the definition accordingly, testing this on 20 different concepts over thousands of cycles. The key results are that many generated counterexamples are invalid according to both human experts and an LM-based judge, but the LM judge accepts about twice as many as humans do, with moderate agreement on specific items. Longer chains of iteration tend to produce longer, more verbose definitions without any corresponding increase in how accurate or stable they are, and certain concepts never reach a stable definition at all. This matters for understanding the limits of language models in performing sustained, iterative reasoning of the kind philosophers rely on.

Core claim

The central discovery is that while language models are capable of participating in the counterexample game of conceptual analysis and repair, the process does not sustain improvement over multiple iterations. Specifically, although LM-generated counterexamples are frequently judged invalid by expert humans and by an LM judge, the LM judge is roughly twice as accepting as humans, yet judgments show moderate consistency. Extended iteration increases the verbosity of definitions without enhancing their accuracy or stability, and some concepts prove resistant to achieving stable definitions regardless.

What carries the argument

Iterated counterexample-repair chains in which one model generates counterexamples to a proposed definition and a second model repairs the definition, with the cycle repeating.

If this is right

LMs can engage in philosophical reasoning through this iterated process.
Many LM-generated counterexamples are invalid by human and LM standards, though LM judges accept more.
Validity judgments are moderately consistent across humans and between humans and LMs.
Extended iteration produces increasingly verbose definitions without accuracy improvements.
Some concepts resist stable definitions in general.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be adapted as a test for whether future models can maintain coherent reasoning over many steps.
Philosophers might use LM assistance for initial definition drafts but would still need human validation for counterexamples.
The resistance of some concepts to definition suggests inherent vagueness that no amount of iteration resolves.
Similar loops could be tested in other domains like legal or scientific concept refinement.

Load-bearing premise

The twenty selected concepts adequately represent the range of philosophical concepts, and judgments of counterexample validity by humans and language models accurately capture true validity without hidden biases.

What would settle it

Demonstrating that for additional concepts or with different models, the iteration process leads to progressively more accurate definitions rather than merely longer ones, or that human and LM validity judgments show low agreement on most items.

Figures

Figures reproduced from arXiv: 2605.03936 by Daniel Drucker, Kyle Mahowald.

**Figure 1.** Figure 1: (A) The counterexample-repair loop, illustrated with two iterations of a game chain. A seed analysis 𝐴0 is challenged by 𝐶𝐸1; all five humans (H1–H5) and Opus accept it as valid. The repair 𝐴1 is challenged by 𝐶𝐸2; four of the five humans reject this CE while Opus and one human accept it. Quoted snippets demonstrate reasoning. (B) Counterexample validity over iterations for Opus self-play vs. mixed-model c… view at source ↗

**Figure 2.** Figure 2: Extended iteration over 50 rounds for our mixed-model setting. view at source ↗

**Figure 3.** Figure 3: Counterexample validity rate per concept across view at source ↗

**Figure 4.** Figure 4: Sub-concept presence across iterations for all 20 concepts, aggregated over 6 independent chains. Color intensity view at source ↗

read the original abstract

Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a proposed definition, another repairs the definition, and the process repeats. Across 20 concepts and thousands of counterexample-repair cycles, we find that, although many LM-generated counterexamples are judged invalid by both expert humans and an LM judge, the LM judge accepts roughly twice as many as humans do. Nonetheless, per-item validity judgments are moderately consistent across humans and between humans and the LM. We further find that extended iteration produces increasingly verbose definitions without improving accuracy. We also see that some concepts resist stable definitions in general. These findings suggest that while LMs can engage in philosophical reasoning, the counterexample-repair loop hits diminishing returns quickly and could be a fruitful test case for evaluating whether LMs can sustain high-level iterated philosophical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LMs can run iterated counterexample-repair loops on definitions but produce longer outputs without accuracy gains and some concepts never stabilize, though the narrow 20-concept set and lenient LM judgments weaken the broader claims.

read the letter

The paper sets up language models to generate counterexamples to a definition and then repair it, repeating the cycle. They run this on 20 concepts for thousands of iterations and compare the results to human expert judgments on validity. The main findings are that LM judges accept roughly twice as many counterexamples as humans, yet the two agree moderately on individual items, that longer iterations just make definitions more verbose without raising accuracy, and that some concepts resist reaching a stable definition at all. This iterated game is a fresh way to probe whether models can sustain the kind of back-and-forth that philosophers do. The data on diminishing returns and the verbosity pattern are straightforward empirical observations worth having. The moderate consistency between human and LM validity calls also gives a usable benchmark point even if the LM is more permissive. The soft spots are the small and undescribed set of 20 concepts. If those ideas cluster in one area of philosophy, the claim that some concepts resist stable definitions does not generalize. Accuracy is measured by the same judges whose leniency difference is already noted, which creates a circularity risk for the no-improvement result. The abstract gives no details on concept selection, exact protocols, or how accuracy was scored, so the numbers are hard to assess without the full methods. This is for researchers working on LM evaluation for abstract or philosophical tasks. Readers building benchmarks for reasoning or testing model limits on conceptual work would find the setup useful to adapt. It deserves a serious referee because the experimental loop is original and the diminishing-returns observation is concrete enough to check, even if revisions are needed on scope and measurement details.

Referee Report

4 major / 2 minor

Summary. The paper examines whether language models can engage in iterated conceptual analysis by generating counterexamples to proposed definitions and repairing them in repeated cycles. Across 20 concepts and thousands of counterexample-repair iterations, it reports that many LM-generated counterexamples are judged invalid by both expert humans and an LM judge (with the LM judge accepting roughly twice as many), moderate consistency in per-item validity judgments across humans and between humans and the LM, that extended iterations yield increasingly verbose definitions without accuracy gains, and that some concepts resist stable definitions altogether. These results are taken to indicate that LMs can perform philosophical reasoning but that the counterexample-repair loop encounters diminishing returns.

Significance. If the empirical patterns hold under clarified protocols, the work supplies a concrete, scalable testbed for evaluating whether LMs can sustain high-level iterated philosophical reasoning, with direct relevance to AI reasoning benchmarks and conceptual robustness. The dual human/LM judgment design and scale of cycles are strengths that could support falsifiable claims about diminishing returns if methodological gaps are closed.

major comments (4)

[Methodology] Methodology (concept selection): The choice of the 20 concepts is not justified as a representative sample of philosophical concepts; without explicit sampling criteria or coverage across subfields (e.g., epistemology vs. metaphysics), the generalization that 'some concepts resist stable definitions in general' lacks support and risks over-extrapolation from a potentially narrow set.
[Results] Results (accuracy quantification): The claim of no accuracy improvement with iteration depends on how validity/accuracy is operationalized, yet the manuscript provides no details on the exact judgment protocol, inter-rater reliability statistics, or statistical tests for the 'no improvement' result; this is load-bearing for the diminishing-returns conclusion.
[Results] Results (LM judge leniency): The reported 2x higher acceptance rate by the LM judge versus humans is consistent with possible systematic bias or leniency; the manuscript should report exact agreement metrics (e.g., Cohen's kappa or percentage agreement per concept) and test whether this difference is statistically significant, as it directly affects interpretation of the consistency findings.
[Discussion] Discussion (circularity risk): Accuracy is assessed via the same human and LM judges used to generate the validity data; an independent validation measure (e.g., expert philosophers blind to the source or a separate gold-standard set) is needed to rule out circularity in the 'no accuracy improvement' and resistance-to-stability claims.

minor comments (2)

[Experimental Setup] Clarify the exact prompt templates and temperature settings used for counterexample generation and repair in the experimental setup section to enable reproducibility.
[Results] Figure captions and tables reporting iteration counts should explicitly state the total number of cycles per concept and any filtering applied to invalid counterexamples.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which identify key areas for strengthening the methodological transparency and interpretive claims in our work. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Methodology] Methodology (concept selection): The choice of the 20 concepts is not justified as a representative sample of philosophical concepts; without explicit sampling criteria or coverage across subfields (e.g., epistemology vs. metaphysics), the generalization that 'some concepts resist stable definitions in general' lacks support and risks over-extrapolation from a potentially narrow set.

Authors: We selected the 20 concepts to span multiple subfields (epistemology, metaphysics, ethics, and philosophy of mind) using common examples from the philosophical literature, but the manuscript does not articulate explicit sampling criteria or claim statistical representativeness. We will add a Methods subsection detailing the selection rationale and diversity across subfields. We will also revise all generalization statements to refer specifically to the concepts examined rather than implying broader applicability, thereby avoiding over-extrapolation. revision: yes
Referee: [Results] Results (accuracy quantification): The claim of no accuracy improvement with iteration depends on how validity/accuracy is operationalized, yet the manuscript provides no details on the exact judgment protocol, inter-rater reliability statistics, or statistical tests for the 'no improvement' result; this is load-bearing for the diminishing-returns conclusion.

Authors: The current manuscript describes the judgment process at a summary level. In the revision we will expand the Methods and Results sections to specify the full human judgment protocol (including instructions and validity criteria), report inter-rater reliability statistics (e.g., Fleiss' kappa among expert humans), and present statistical tests (e.g., regression of validity scores on iteration count) supporting the absence of accuracy gains. These additions will directly bolster the diminishing-returns claim. revision: yes
Referee: [Results] Results (LM judge leniency): The reported 2x higher acceptance rate by the LM judge versus humans is consistent with possible systematic bias or leniency; the manuscript should report exact agreement metrics (e.g., Cohen's kappa or percentage agreement per concept) and test whether this difference is statistically significant, as it directly affects interpretation of the consistency findings.

Authors: We will add the requested metrics. The revised manuscript will report Cohen's kappa and percentage agreement between the LM judge and human judges, both overall and broken down by concept. We will also include a statistical test (e.g., McNemar's test) for the significance of the acceptance-rate difference. These details will allow clearer evaluation of any leniency and its bearing on the consistency results. revision: yes
Referee: [Discussion] Discussion (circularity risk): Accuracy is assessed via the same human and LM judges used to generate the validity data; an independent validation measure (e.g., expert philosophers blind to the source or a separate gold-standard set) is needed to rule out circularity in the 'no accuracy improvement' and resistance-to-stability claims.

Authors: The human judges were independent experts uninvolved in definition generation or counterexample creation, and the LM judge operated under a distinct prompting regime. We nevertheless recognize the potential circularity concern. In the revised Discussion we will explicitly address this issue, clarify the independence of the human evaluations, acknowledge the lack of a fully blind external gold-standard set as a limitation, and propose it as a direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical study with no derivations or self-referential fitting.

full rationale

The paper conducts an empirical investigation of LM performance on iterated counterexample generation and definition repair across 20 concepts. All claims rest on observed experimental outcomes, including validity judgments from human experts and an LM judge, plus measurements of definition verbosity and accuracy over iterations. No equations, fitted parameters, or derivations appear; results are externally benchmarked against human judgments rather than internal model parameters or self-citations. The moderate consistency between judges and the finding of diminishing returns are direct data observations, not reductions to prior inputs by construction. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that human expert judgments provide a reliable benchmark for counterexample validity and that the selected concepts allow generalizable conclusions about LM philosophical reasoning; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human expert judgments serve as a reliable benchmark for assessing the validity of LM-generated counterexamples.
The paper uses human judgments to evaluate LM outputs and compare against LM judges.

pith-pipeline@v0.9.0 · 5459 in / 1323 out tokens · 106237 ms · 2026-05-07T16:08:42.038177+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 3 internal anchors

[1]

1978 , publisher =

Suits, Bernard , title =. 1978 , publisher =

work page 1978
[2]

Dmitri , title =

Gallow, J. Dmitri , title =. The. 2022 , edition =

work page 2022
[3]

On the tip of the tongue: Analyzing conceptual representation in large language models with reverse-dictionary probe

On the tip of the tongue: Analyzing conceptual representation in large language models with reverse-dictionary probe , author=. arXiv preprint arXiv:2402.14404 , year=

work page arXiv
[4]

Advances in neural information processing systems , volume=

Actor-critic algorithms , author=. Advances in neural information processing systems , volume=

work page
[5]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review arXiv
[6]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[7]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review arXiv
[8]

Advances in neural information processing systems , volume=

When to make exceptions: Exploring language models as accounts of human moral judgment , author=. Advances in neural information processing systems , volume=

work page
[9]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review arXiv
[10]

The Thirteenth International Conference on Learning Representations , year=

Language Model Alignment in Multilingual Trolley Problems , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[11]

Nature Machine Intelligence , volume=

Investigating machine moral judgement through the Delphi experiment , author=. Nature Machine Intelligence , volume=. 2025 , publisher=

work page 2025
[12]

Advances in Neural Information Processing Systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[13]

Science , author=

Predicting pragmatic reasoning in language games , volume=. Science , author=. 2012 , pages=

work page 2012
[14]

1975 , publisher=

Logic and conversation , author=. 1975 , publisher=

work page 1975
[15]

Generative adversarial nets , year=

Goodfellow, Ian and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua , booktitle=. Generative adversarial nets , year=

work page
[16]

The emergence of linguistic structure: An overview of the iterated learning model , booktitle=

Kirby, Simon and Hurford, James R , year=. The emergence of linguistic structure: An overview of the iterated learning model , booktitle=

work page
[17]

Current Opinion in Neurobiology , volume=

Simon Kirby and Tom Griffiths and Kenny Smith , title=. Current Opinion in Neurobiology , volume=. 2014 , pages=

work page 2014
[18]

1953 , publisher =

Wittgenstein, Ludwig , title =. 1953 , publisher =

work page 1953
[19]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

work page

[1] [1]

1978 , publisher =

Suits, Bernard , title =. 1978 , publisher =

work page 1978

[2] [2]

Dmitri , title =

Gallow, J. Dmitri , title =. The. 2022 , edition =

work page 2022

[3] [3]

On the tip of the tongue: Analyzing conceptual representation in large language models with reverse-dictionary probe

On the tip of the tongue: Analyzing conceptual representation in large language models with reverse-dictionary probe , author=. arXiv preprint arXiv:2402.14404 , year=

work page arXiv

[4] [4]

Advances in neural information processing systems , volume=

Actor-critic algorithms , author=. Advances in neural information processing systems , volume=

work page

[5] [5]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review arXiv

[6] [6]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page

[7] [7]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review arXiv

[8] [8]

Advances in neural information processing systems , volume=

When to make exceptions: Exploring language models as accounts of human moral judgment , author=. Advances in neural information processing systems , volume=

work page

[9] [9]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review arXiv

[10] [10]

The Thirteenth International Conference on Learning Representations , year=

Language Model Alignment in Multilingual Trolley Problems , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[11] [11]

Nature Machine Intelligence , volume=

Investigating machine moral judgement through the Delphi experiment , author=. Nature Machine Intelligence , volume=. 2025 , publisher=

work page 2025

[12] [12]

Advances in Neural Information Processing Systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=

work page

[13] [13]

Science , author=

Predicting pragmatic reasoning in language games , volume=. Science , author=. 2012 , pages=

work page 2012

[14] [14]

1975 , publisher=

Logic and conversation , author=. 1975 , publisher=

work page 1975

[15] [15]

Generative adversarial nets , year=

Goodfellow, Ian and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua , booktitle=. Generative adversarial nets , year=

work page

[16] [16]

The emergence of linguistic structure: An overview of the iterated learning model , booktitle=

Kirby, Simon and Hurford, James R , year=. The emergence of linguistic structure: An overview of the iterated learning model , booktitle=

work page

[17] [17]

Current Opinion in Neurobiology , volume=

Simon Kirby and Tom Griffiths and Kenny Smith , title=. Current Opinion in Neurobiology , volume=. 2014 , pages=

work page 2014

[18] [18]

1953 , publisher =

Wittgenstein, Ludwig , title =. 1953 , publisher =

work page 1953

[19] [19]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

work page