A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

Qingchuan Ma; Rongrong Ji; Tianyu Xie; Xiawu Zheng; Yongkang Xie; Yuexiao Ma

arxiv: 2605.17278 · v1 · pith:CLY6E754new · submitted 2026-05-17 · 💻 cs.AI · cs.LG

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

Qingchuan Ma , Yuexiao Ma , Yongkang Xie , Tianyu Xie , Xiawu Zheng , Rongrong Ji This is my paper

Pith reviewed 2026-05-20 13:19 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords abstract reasoningbenchmark generationlarge language modelscycle consistencyprogrammatic verificationautomatic evaluationhallucination mitigation

0 comments

The pith

Cycle consistency between forward and inverse operations guarantees a unique solution for each automatically generated abstract reasoning task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents A2RBench, an automated pipeline that generates diverse abstract reasoning tasks for LLMs, expands them by reusing rules, and verifies them to avoid hallucinations. The core advance is a theoretical proof that checking whether an inverse operation exactly undoes the forward operation ensures each task admits only one valid solution. This programmatic check replaces manual annotation and allows benchmarks to scale while remaining formally verifiable. Evaluations using the resulting tasks show top LLMs reach just 39.8 percent accuracy against a human baseline of 68.5 percent, with even steeper gaps on three-dimensional problems. The work also reports that higher input complexity can sometimes reduce the reasoning steps required.

Core claim

The authors establish a theoretical framework proving that programmatic verification of cycle consistency—checking if the inverse operation perfectly reverses the forward operation—guarantees a unique solution for LLM-generated abstract reasoning tasks, enabling an automatic and verifiable benchmark generation pipeline called A2RBench.

What carries the argument

The cycle-consistency verification mechanism, which programmatically confirms that applying the inverse operation to the output of the forward operation recovers the original input exactly.

If this is right

Benchmark creation becomes scalable without relying on expensive human annotation.
Top LLMs reach only 39.8 percent accuracy on representative tasks while humans reach 68.5 percent.
Models exhibit greater difficulty with three-dimensional tasks than with two- or one-dimensional ones.
Higher information complexity in inputs can counterintuitively reduce the number of reasoning steps needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verification approach could be tested for generating reliable tasks in adjacent areas such as planning or mathematical proof construction.
Benchmarks built this way could be refreshed continuously as models improve, staying ahead of saturation.
The observed link between input complexity and simpler reasoning suggests experiments that deliberately vary density to improve model generalization.

Load-bearing premise

The cycle-consistency check can be implemented programmatically on LLM-generated tasks without the verification step itself introducing undetected errors or allowing tasks that still permit multiple valid solutions in practice.

What would settle it

A collection of tasks that pass the cycle-consistency test yet still admit more than one distinct valid solution or contain undetected hallucinations would show the uniqueness guarantee does not hold.

Figures

Figures reproduced from arXiv: 2605.17278 by Qingchuan Ma, Rongrong Ji, Tianyu Xie, Xiawu Zheng, Yongkang Xie, Yuexiao Ma.

**Figure 1.** Figure 1: The A2RBench automated pipeline illustrated with an example (rule: “index permutation via modular arithmetic”). The rule permutes positions: given input length n, find smallest integer k where 2 ≤ k ≤ n − 1 and gcd(k, n) = 1, then map position i to (i × k) mod n. With n = 14, smallest coprime is k = 3, mapping 0 → 0, 1 → 3, 2 → 6, etc. Stage 1 (Seed Generation): Author model generates rule description vali… view at source ↗

**Figure 2.** Figure 2: A Spectrum of Cognitive Outcomes across LLMs. This stacked bar chart categorizes each model’s performance into a six-part spectrum. A warm color gradient (red to amber) denotes different types of failures, from fundamental breakdowns to execution errors. A cool color gradient (light green to dark green) distinguishes between three qualities of success: Surface Fitting, Inferior Rule, and True Generalizatio… view at source ↗

**Figure 3.** Figure 3: Model Performance Across Logical Dimensions. Accuracy of LLMs on 1D, 2D, and 3D abstract reasoning tasks. A consistent performance dip on 2D tasks (red bars) is observable across most models. can simplify the reasoning process. 5.1 Reasoning Limitations Current LLMs show significant deficiencies in abstract reasoning. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: The Augmentation Paradox. A multi-axis plot correlating Information Complexity (blue) and Model Accuracy (red) across problem variations. using compression ratio—higher ratios indicate more structured, less compressible inputs. The figure reveals a striking correlation: as input complexity peaks, model accuracy rises sharply. The pattern is clearest at variation V4, where input complexity peaks at 4.29 yet… view at source ↗

read the original abstract

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification--testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)--guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8% vs. 68.5%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is an LLM pipeline for auto-generating and scaling abstract reasoning benchmarks, backed by a claim that cycle consistency proves unique solutions, but the supporting details are thin.

read the letter

The punchline is that A2RBench tries to automate benchmark creation for abstract reasoning by having LLMs generate tasks, expand them with validated rules, and apply a programmatic cycle-consistency check to ensure each task has only one valid solution. This targets the usual problems of manual cost and memorization in existing tests. If the verification step works as described, it could let people produce larger sets of non-memorizable items without constant human oversight. The reported results on current LLMs, including weaker performance on 3D tasks and the note that higher information complexity sometimes simplifies reasoning, give some concrete data points worth looking at. The pipeline itself is a practical attempt to combine generation with a formal-sounding filter, and the focus on high-dimensional cases is a reasonable extension beyond typical 2D puzzles. The soft spots sit mainly in the theoretical part. The abstract states that cycle consistency guarantees uniqueness but gives no derivation steps or checks for cases where the inverse might not be uniquely defined, especially for complex or high-dimensional rules. Without that, it's unclear whether the check actually rules out multiple solutions in practice or just passes on surface-level reversibility. The human comparison uses a representative subset whose selection method is not explained, which makes the 39.8% versus 68.5% gap harder to interpret. This paper is aimed at people building or using LLM evaluation suites, particularly those concerned with generalization rather than pattern matching. Readers who need fresh reasoning items or ideas for automated verification will get the most out of the pipeline description. It has enough of a distinct angle on scalable, verifiable generation to deserve a serious referee, though the proof and experimental choices need expansion first. I would send it for peer review with requests for the full theoretical argument and clearer task examples.

Referee Report

3 major / 2 minor

Summary. The paper introduces A2RBench, an automated pipeline for generating, expanding, evaluating, and analyzing abstract reasoning tasks for LLMs. LLMs generate diverse tasks in the generation stage and reuse validated rules to scale via expansion; hallucinations are addressed by a theoretical framework proving that programmatic verification (testing whether the inverse operation perfectly reverses the forward operation via cycle consistency) guarantees a unique solution. Extensive evaluations on mainstream LLMs report that top models underperform humans on a representative subset (39.8% vs. 68.5%), struggle more with 3D than 1D/2D tasks, and exhibit counterintuitive simplification with higher information complexity.

Significance. If the cycle-consistency guarantee is rigorously established and the verification step is free of false positives for LLM-generated rules, the pipeline could enable scalable, annotation-free creation of verifiable abstract reasoning benchmarks that distinguish genuine reasoning from memorization. The empirical results would then provide a clear signal of current LLM limitations in high-dimensional abstraction.

major comments (3)

[theoretical framework] Abstract and theoretical framework section: the central claim that cycle consistency 'guarantees a unique solution' is load-bearing for the entire generation and scaling process, yet the manuscript provides no derivation steps, formal statement of the inverse operation, or edge-case analysis (e.g., non-canonical inverses or 3D rule representations). Without these, it is impossible to verify whether the programmatic check actually eliminates tasks with multiple valid solutions.
[evaluation] Evaluation section: the claim that LLMs exhibit 'fundamental deficiencies' rests on performance on a 'representative subset' (39.8% vs. 68.5% human), but the selection criteria for this subset are not specified. This omission prevents assessment of whether the reported gap generalizes or is an artifact of subset construction.
[generation and verification] Task generation and verification pipeline: the assumption that the inverse operation can be implemented deterministically and unambiguously for arbitrary LLM-generated rules (including high-dimensional cases) is not justified. If multiple equivalent inverses exist or the check admits false positives, cycle consistency can hold while multiple solutions remain possible, undermining the uniqueness guarantee.

minor comments (2)

[abstract] Abstract: the phrase 'inputs with higher information complexity can simplify the reasoning process' would benefit from a brief illustrative example or definition of 'information complexity' to avoid ambiguity.
[introduction] Notation: ensure consistent use of 'cycle consistency' versus 'programmatic verification' throughout; a short glossary or one-sentence definition in the introduction would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review of our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of the theoretical framework, evaluation details, and pipeline assumptions.

read point-by-point responses

Referee: [theoretical framework] Abstract and theoretical framework section: the central claim that cycle consistency 'guarantees a unique solution' is load-bearing for the entire generation and scaling process, yet the manuscript provides no derivation steps, formal statement of the inverse operation, or edge-case analysis (e.g., non-canonical inverses or 3D rule representations). Without these, it is impossible to verify whether the programmatic check actually eliminates tasks with multiple valid solutions.

Authors: We appreciate the referee's emphasis on rigor for this central claim. The theoretical framework in Section 3 establishes that cycle consistency implies uniqueness by demonstrating that a perfect inverse reversal forces the forward rule to be the only consistent mapping. To improve accessibility and verifiability, the revised manuscript will add explicit derivation steps, a formal mathematical statement defining the inverse operation, and dedicated analysis of edge cases including non-canonical inverses and 3D rule representations. revision: yes
Referee: [evaluation] Evaluation section: the claim that LLMs exhibit 'fundamental deficiencies' rests on performance on a 'representative subset' (39.8% vs. 68.5% human), but the selection criteria for this subset are not specified. This omission prevents assessment of whether the reported gap generalizes or is an artifact of subset construction.

Authors: We agree that explicit selection criteria are necessary for readers to evaluate generalizability. The representative subset was chosen to balance task dimensionality (1D/2D/3D), rule complexity, and coverage of the generated task distribution. In the revision we will insert a clear description of these criteria and the construction procedure in the Evaluation section. revision: yes
Referee: [generation and verification] Task generation and verification pipeline: the assumption that the inverse operation can be implemented deterministically and unambiguously for arbitrary LLM-generated rules (including high-dimensional cases) is not justified. If multiple equivalent inverses exist or the check admits false positives, cycle consistency can hold while multiple solutions remain possible, undermining the uniqueness guarantee.

Authors: This concern is well taken. Within the pipeline the inverse is constructed directly from the validated rule structure produced by the LLM, which by design yields a deterministic mapping for each accepted task. We will augment the revised manuscript with additional justification, concrete examples for high-dimensional rules, and discussion of safeguards against equivalent inverses or false positives to make this assumption more transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central theoretical claim is the establishment of a new framework proving that programmatic cycle-consistency verification (inverse perfectly reversing forward) guarantees a unique solution for generated tasks. This is presented as an independent proof rather than a quantity fitted to data, a renamed empirical pattern, or a result imported via self-citation chain. No equations or steps reduce the uniqueness guarantee to the inputs by construction; the verification is described as an external programmatic check applied after LLM generation. The pipeline's scaling and hallucination-elimination steps therefore rest on this claimed proof as external support, not on re-deriving the same property from the benchmark data itself. This is the most common honest finding for papers that introduce and prove a new verification property.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on an unshown proof that cycle consistency implies uniqueness and on the assumption that LLM-generated tasks admit clean programmatic inverses; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Cycle consistency (forward operation followed by its inverse returns the original input) guarantees a unique solution for the generated tasks.
Invoked in the theoretical framework section of the abstract to eliminate hallucinations.

pith-pipeline@v0.9.0 · 5785 in / 1245 out tokens · 48818 ms · 2026-05-20T13:19:30.273257+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we prove that programmatic verification—testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)—guarantees a unique solution (Theorem 3.1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

This is guaranteed if f is a well-defined function

Proof of Uniqueness:A task has a unique solution if the functionf maps every inputx∈ X to exactly one output y∈ Y . This is guaranteed if f is a well-defined function. The more rigorous aspect of uniqueness in our context is ensuring that no two distinct inputs map to the same output, which would make the rule ambiguous from an inverse perspective. The cy...

work page
[2]

In our framework, the example setEand the ground-truth answery q are not supplied externally but are generated programmatically *after* the functionf has been defined and verified

Proof of Consistency:A task must be consistent, meaning the underlying rulef applies to all provided example pairs( xi, yi) ∈E and to the final question-answer pair(xq, yq). In our framework, the example setEand the ground-truth answery q are not supplied externally but are generated programmatically *after* the functionf has been defined and verified. Th...

work page
[3]

correct

Proof of Verifiability:A task must be verifiable, meaning its logical soundness can be confirmed through a deterministic procedure. The cycle consistency check is precisely this procedure. It is a deterministic, executable program that takes the generated functionsf and g as input and tests the property g(f(x)) = x. The successful completion of this check...

work page
[4]

NO I N F O R M A T I O N LOSS : The rule must not destroy i n f o r m a t i o n

work page
[5]

Forward Rule : D es cri be s how to t ran sf or m Input A -> Output B

work page
[6]

r e a s o n i n g _ o f _ c r e a t i o n

Inverse Rule : D es cri be s how to verify / r e c o n s t r u c t Input A <- Output B exactly . Output Format ( Strict JSON ) : {{ " r e a s o n i n g _ o f _ c r e a t i o n ": " My r ea so ni ng for this r e v e r s i b l e { d i m e n s i o n a l i t y } rule ,→..." , " r u l e _ d e s c r i p t i o n ": " Clear d e s c r i p t i o n of the Forward t ...

work page
[7]

Forward Rule : ‘{ r u l e _ d e s c r i p t i o n } ‘

work page
[8]

Inverse Rule : ‘{ i n v e r s e _ r u l e _ d e s c r i p t i o n } ‘ Part 1: The Twin F un ct ion s You must i mpl em en t TWO f un ct ion s :

work page
[9]

‘ def t r a n s f o r m _ g r i d ( grid ) : ‘ -> Returns the t r a n s f o r m e d output

work page
[10]

re as on in g

‘ def i n v e r s e _ t r a n s f o r m _ g r i d ( grid ) : ‘ -> Returns the exact original input *. Part 2: R o b u s t n e s s Handle edge cases g r a c e f u l l y . Cycle C o n s i s t e n c y : The logic MUST satisfy ‘ i n v e r s e _ t r a n s f o r m _ g r i d ( t r a n s f o r m _ g r i d ,→( x ) ) == x ‘. Output Format ( Strict JSON with Python ...

work page
[11]

** C o n s i s t e n c y :** Does the Python code actually i mp lem en t the Forward and ,→Inverse rules d es cri be d in natural language ?

work page
[12]

random

** S o l v a b i l i t y :** Are the provided examples s u f f i c i e n t for a human or AI to ,→deduce the rule ? ( i . e . , The rule isn ’ t " random " or hidden inside the code ,→without external logic )

work page
[13]

re as on in g

** Quality :** Is the puzzle non - trivial and i n t e r e s t i n g ? ** Output Format ( Strict JSON ) :** ‘‘‘ json {{ " re as on in g ": " A concise analysis of rule - to - code consistency , example quality , ,→and overall puzzle non - t r i v i a l i t y ." , " is_valid ": < true or false > }} ‘‘‘ Listing 7.Prompt for Integrated Code and Puzzle Valida...

work page
[14]

19 MAC-AutoML

Primary Goal : D et erm in e if the ‘ Model ’ s Answer ‘ is e q u i v a l e n t to the ‘ Ground ,→Truth ‘. 19 MAC-AutoML

work page
[15]

E q u i v a l e n c e : The answer is correct if it is an exact , character - for - ch ar act er ,→match

work page
[16]

j u s t i f i c a t i o n

R eas on in g vs . Answer : Your judgment must be based e x c l u s i v e l y on the final ,→answer provided , not the model ’ s re as on in g process . Output Format ( Strictly adhere to this JSON st ruc tu re ) : {{ " j u s t i f i c a t i o n ": " < A brief , one - sentence e x p l a n a t i o n for your decision . >" , " i s _ c o r r e c t ": < true ...

work page
[17]

Validity : The new input MUST be valid for the provided code

work page
[18]

D ive rs it y : The new input must be strictly d if fe re nt from any in the History

work page
[19]

S K I P P E D _ L O W _ E N T R O P Y

Low Entropy Check : If the rule is too r e s t r i c t i v e to generate new inputs , you ,→MUST return status " S K I P P E D _ L O W _ E N T R O P Y ". Output Format ( Strict JSON ) : {{ " re as on in g ": " Explain your strategy for this test case ( e . g . , testing empty edge ,→case ) ." , " v a r i a t i o n _ t y p e ": " Label for this case ( e . ...

work page
[20]

Lead QA Engineer

Context and Grounding.To ensure logical fidelity, the prompt grounds the LLM in two ways. First, the Role-Playing Personaof a “Lead QA Engineer" conditions the model to shift its operational mode from pure generation to rigorous testing, encouraging it to consider edge cases and adversarial scenarios. Second, and most critically, theCode as a Logical Orac...

work page
[21]

Rather than allowing for unconstrained generation, we dynamically inject specific procedural guidance via the variation_guidance placeholder

Dynamic Objective and Phased Strategy.This is the core mechanism for systematic augmentation. Rather than allowing for unconstrained generation, we dynamically inject specific procedural guidance via the variation_guidance placeholder. This guidance is determined by the currentvariation_index (1 through

work page
[22]

• Mid Stage (V4-V6):EmphasizesEdge Cases(e.g., empty structures, singletons, repetitive elements) to test the solver’s robustness

and follows a three-phase strategy: • Early Stage (V1-V3):Focuses onStandard Variationsto verify baseline comprehension of the rule’s core logic. • Mid Stage (V4-V6):EmphasizesEdge Cases(e.g., empty structures, singletons, repetitive elements) to test the solver’s robustness. • Late Stage (V7-V9):TargetsComplex or Adversarial Casesto stress-test the solve...

work page
[23]

the input must be the exact string ‘ABC’

State Management and Constraints.To maintain high data quality, the prompt incorporates two critical constraints. First, by providing theInput History, it explicitly instructs the model to generate novel inputs, preventing redundancy. Second, theLow-Entropy Escape Hatchserves as a crucial fail- safe mechanism. For rules that are logically restrictive (e.g...

work page
[24]

j u s t i f i c a t i o n

Structured Output.The prompt mandates a strict JSON output format to allow for reliable, automated parsing and validation. The inclusion ofreasoning and variation_type fields provides valuable metadata, enabling detailed downstream analysis of both the generated data and the augmentation model’s strategic approach. In summary, this prompt design channels ...

work page

[1] [1]

This is guaranteed if f is a well-defined function

Proof of Uniqueness:A task has a unique solution if the functionf maps every inputx∈ X to exactly one output y∈ Y . This is guaranteed if f is a well-defined function. The more rigorous aspect of uniqueness in our context is ensuring that no two distinct inputs map to the same output, which would make the rule ambiguous from an inverse perspective. The cy...

work page

[2] [2]

In our framework, the example setEand the ground-truth answery q are not supplied externally but are generated programmatically *after* the functionf has been defined and verified

Proof of Consistency:A task must be consistent, meaning the underlying rulef applies to all provided example pairs( xi, yi) ∈E and to the final question-answer pair(xq, yq). In our framework, the example setEand the ground-truth answery q are not supplied externally but are generated programmatically *after* the functionf has been defined and verified. Th...

work page

[3] [3]

correct

Proof of Verifiability:A task must be verifiable, meaning its logical soundness can be confirmed through a deterministic procedure. The cycle consistency check is precisely this procedure. It is a deterministic, executable program that takes the generated functionsf and g as input and tests the property g(f(x)) = x. The successful completion of this check...

work page

[4] [4]

NO I N F O R M A T I O N LOSS : The rule must not destroy i n f o r m a t i o n

work page

[5] [5]

Forward Rule : D es cri be s how to t ran sf or m Input A -> Output B

work page

[6] [6]

r e a s o n i n g _ o f _ c r e a t i o n

Inverse Rule : D es cri be s how to verify / r e c o n s t r u c t Input A <- Output B exactly . Output Format ( Strict JSON ) : {{ " r e a s o n i n g _ o f _ c r e a t i o n ": " My r ea so ni ng for this r e v e r s i b l e { d i m e n s i o n a l i t y } rule ,→..." , " r u l e _ d e s c r i p t i o n ": " Clear d e s c r i p t i o n of the Forward t ...

work page

[7] [7]

Forward Rule : ‘{ r u l e _ d e s c r i p t i o n } ‘

work page

[8] [8]

Inverse Rule : ‘{ i n v e r s e _ r u l e _ d e s c r i p t i o n } ‘ Part 1: The Twin F un ct ion s You must i mpl em en t TWO f un ct ion s :

work page

[9] [9]

‘ def t r a n s f o r m _ g r i d ( grid ) : ‘ -> Returns the t r a n s f o r m e d output

work page

[10] [10]

re as on in g

‘ def i n v e r s e _ t r a n s f o r m _ g r i d ( grid ) : ‘ -> Returns the exact original input *. Part 2: R o b u s t n e s s Handle edge cases g r a c e f u l l y . Cycle C o n s i s t e n c y : The logic MUST satisfy ‘ i n v e r s e _ t r a n s f o r m _ g r i d ( t r a n s f o r m _ g r i d ,→( x ) ) == x ‘. Output Format ( Strict JSON with Python ...

work page

[11] [11]

** C o n s i s t e n c y :** Does the Python code actually i mp lem en t the Forward and ,→Inverse rules d es cri be d in natural language ?

work page

[12] [12]

random

** S o l v a b i l i t y :** Are the provided examples s u f f i c i e n t for a human or AI to ,→deduce the rule ? ( i . e . , The rule isn ’ t " random " or hidden inside the code ,→without external logic )

work page

[13] [13]

re as on in g

** Quality :** Is the puzzle non - trivial and i n t e r e s t i n g ? ** Output Format ( Strict JSON ) :** ‘‘‘ json {{ " re as on in g ": " A concise analysis of rule - to - code consistency , example quality , ,→and overall puzzle non - t r i v i a l i t y ." , " is_valid ": < true or false > }} ‘‘‘ Listing 7.Prompt for Integrated Code and Puzzle Valida...

work page

[14] [14]

19 MAC-AutoML

Primary Goal : D et erm in e if the ‘ Model ’ s Answer ‘ is e q u i v a l e n t to the ‘ Ground ,→Truth ‘. 19 MAC-AutoML

work page

[15] [15]

E q u i v a l e n c e : The answer is correct if it is an exact , character - for - ch ar act er ,→match

work page

[16] [16]

j u s t i f i c a t i o n

R eas on in g vs . Answer : Your judgment must be based e x c l u s i v e l y on the final ,→answer provided , not the model ’ s re as on in g process . Output Format ( Strictly adhere to this JSON st ruc tu re ) : {{ " j u s t i f i c a t i o n ": " < A brief , one - sentence e x p l a n a t i o n for your decision . >" , " i s _ c o r r e c t ": < true ...

work page

[17] [17]

Validity : The new input MUST be valid for the provided code

work page

[18] [18]

D ive rs it y : The new input must be strictly d if fe re nt from any in the History

work page

[19] [19]

S K I P P E D _ L O W _ E N T R O P Y

Low Entropy Check : If the rule is too r e s t r i c t i v e to generate new inputs , you ,→MUST return status " S K I P P E D _ L O W _ E N T R O P Y ". Output Format ( Strict JSON ) : {{ " re as on in g ": " Explain your strategy for this test case ( e . g . , testing empty edge ,→case ) ." , " v a r i a t i o n _ t y p e ": " Label for this case ( e . ...

work page

[20] [20]

Lead QA Engineer

Context and Grounding.To ensure logical fidelity, the prompt grounds the LLM in two ways. First, the Role-Playing Personaof a “Lead QA Engineer" conditions the model to shift its operational mode from pure generation to rigorous testing, encouraging it to consider edge cases and adversarial scenarios. Second, and most critically, theCode as a Logical Orac...

work page

[21] [21]

Rather than allowing for unconstrained generation, we dynamically inject specific procedural guidance via the variation_guidance placeholder

Dynamic Objective and Phased Strategy.This is the core mechanism for systematic augmentation. Rather than allowing for unconstrained generation, we dynamically inject specific procedural guidance via the variation_guidance placeholder. This guidance is determined by the currentvariation_index (1 through

work page

[22] [22]

• Mid Stage (V4-V6):EmphasizesEdge Cases(e.g., empty structures, singletons, repetitive elements) to test the solver’s robustness

and follows a three-phase strategy: • Early Stage (V1-V3):Focuses onStandard Variationsto verify baseline comprehension of the rule’s core logic. • Mid Stage (V4-V6):EmphasizesEdge Cases(e.g., empty structures, singletons, repetitive elements) to test the solver’s robustness. • Late Stage (V7-V9):TargetsComplex or Adversarial Casesto stress-test the solve...

work page

[23] [23]

the input must be the exact string ‘ABC’

State Management and Constraints.To maintain high data quality, the prompt incorporates two critical constraints. First, by providing theInput History, it explicitly instructs the model to generate novel inputs, preventing redundancy. Second, theLow-Entropy Escape Hatchserves as a crucial fail- safe mechanism. For rules that are logically restrictive (e.g...

work page

[24] [24]

j u s t i f i c a t i o n

Structured Output.The prompt mandates a strict JSON output format to allow for reliable, automated parsing and validation. The inclusion ofreasoning and variation_type fields provides valuable metadata, enabling detailed downstream analysis of both the generated data and the augmentation model’s strategic approach. In summary, this prompt design channels ...

work page