StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models

Guilin Qi; Haofen Wang; Huajun Chen; Shenyu Zhang; Xiaoying Huang; Yangyang Ma; Yongrui Chen

arxiv: 2605.01939 · v1 · submitted 2026-05-03 · 💻 cs.CL

StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models

Yongrui Chen , Yangyang Ma , Xiaoying Huang , Shenyu Zhang , Huajun Chen , Haofen Wang , Guilin Qi This is my paper

Pith reviewed 2026-05-09 17:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords dynamic benchmarkingLLM evaluationfailure-driven synthesisknowledge-intensive reasoningbenchmark contaminationdata synthesisreasoning evaluationmodel weaknesses

0 comments

The pith

A failure-driven synthesis framework turns observed LLM errors into dynamic benchmarks that produce larger performance drops while preserving explicit difficulty factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Static benchmarks for large language models suffer from contamination and overfitting especially on knowledge-intensive reasoning tasks. This paper proposes StressEval to generate new test instances directly from model failures through a three-stage process. The method builds a semi-structured difficulty card to identify failed reasoning steps and root causes, then applies dual-perspective synthesis to target knowledge gaps and reasoning breakdowns while keeping original difficulty intact, followed by a gating step for quality. If the approach works, it enables focused dynamic suites like Dynamic OneEval that challenge models more than their static sources without sacrificing answerability or controllability. This matters for creating evaluations that better diagnose specific weaknesses and support targeted iteration on LLMs.

Core claim

StressEval is a failure-driven data synthesis framework that constructs a semi-structured difficulty card identifying the failed reasoning step and its root cause, applies dual-perspective instance synthesis targeting both knowledge gaps and reasoning breakdowns while preserving underlying difficulty factors, and uses a gating mechanism to retain only grounded unambiguous instances. Seeded from multiple knowledge-intensive reasoning datasets it produces Dynamic OneEval, on which several state-of-the-art LLMs show substantially larger performance drops than on the original benchmarks while retaining explicit difficulty factors for more actionable iteration.

What carries the argument

The three-stage failure-driven data synthesis framework consisting of difficulty card construction, dual-perspective synthesis, and gating mechanism that converts observed model failures into new controllable test instances.

If this is right

Dynamic benchmarks generated this way expose substantially larger performance drops across multiple state-of-the-art LLMs compared to static originals.
Explicit difficulty factors remain available for targeted model iteration and improvement.
The synthesis applies across multiple seeded knowledge-intensive reasoning datasets while retaining controllability and answerability.
Contamination and overfitting issues in static benchmarks can be mitigated through failure-based regeneration.
Evaluation becomes more diagnostic because new instances directly link back to specific observed failure modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated application could create iterative loops where models are tested, failures synthesized into new tests, and retrained, accelerating robustness gains.
The retained difficulty factors might allow systematic comparison of how different model scales or architectures handle the same root causes.
If extended to multimodal or agentic tasks the same failure-to-instance pipeline could expose evaluation gaps in those domains as well.
Training objectives that explicitly target the identified root causes from the difficulty cards could yield more efficient fixes than generic scaling.

Load-bearing premise

The dual-perspective synthesis and gating mechanism can reliably produce grounded unambiguous instances that preserve original failure root causes without introducing new artifacts or altering difficulty.

What would settle it

A check of whether the synthesized instances in Dynamic OneEval exhibit the same root causes as the original failures or instead introduce new difficulties that explain the larger drops rather than the preserved factors.

Figures

Figures reproduced from arXiv: 2605.01939 by Guilin Qi, Haofen Wang, Huajun Chen, Shenyu Zhang, Xiaoying Huang, Yangyang Ma, Yongrui Chen.

**Figure 1.** Figure 1: Upper left: static benchmarks degrade via overfitting and view at source ↗

**Figure 2.** Figure 2: STRESSEVAL framwork. Stage 1 performs structured error analysis and produces per-case difficulty cards; Stage 2 synthesizes new instances via knowledge black-box for Γk and reasoning-skeleton for Γr, showing original and synthesized question-answer pairs; Stage 3 applies a LLM gating mechanism to keep answerable, unambiguous, yet unsolved hard instances. 175 • Knowledge-stress (Γk) failures: instances in w… view at source ↗

**Figure 3.** Figure 3: Left: root-cause distribution among the seed failure cases view at source ↗

**Figure 4.** Figure 4: Performance of different LLMs on each root cause. Due to space limitations, only representative root causes are included here. view at source ↗

**Figure 5.** Figure 5: Performance of STRESSEVAL using different backbones. We use different backbone LLMs (x-axis) to run STRESSEVAL and generate evaluation instances, and then benchmark a set of target LLMs (y-axis) on the resulting datasets. 480 6 Related Work 481 Knowledge-Intensive Reasoning Benchmarks Early suc482 cess of LLMs, exemplified by models like GPT-4 [OpenAI, 483 2023] and DeepSeek [Liu et al., 2024], demonstrat… view at source ↗

**Figure 6.** Figure 6: Root-cause distribution among the seed failure cases view at source ↗

**Figure 7.** Figure 7: Root-cause distribution of our DYNAMIC-ONEEVAL, which is comparatively more balanced view at source ↗

read the original abstract

Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the expense of answerability and controllability In this paper we propose StressEval a failure driven data synthesis framework that turns observed model failures into dynamic challenging and controllable test instances StressEval consists of three stages first it constructs a semi structured difficulty card that identifies the failed reasoning step and its root cause second it applies a dual perspective instance synthesis method that targets both knowledge gaps and reasoning breakdowns while preserving the underlying difficulty factors and third it applies a gating mechanism to retain only grounded unambiguous instances Seeding from multiple knowledge intensive reasoning datasets we employ StressEval to build Dynamic OneEval a focused suite of challenging dynamic benchmark Across several state of the art LLMs Dynamic OneEval yields substantially larger performance drops than the original benchmarks while retaining explicit difficulty factors enabling more actionable iteration

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StressEval gives a clear three-stage pipeline for turning failures into dynamic tests via difficulty cards and dual synthesis, but the abstract leaves the key validation gap unaddressed.

read the letter

Colleague, the main thing to know about this paper is that it describes StressEval, a failure-driven synthesis method that builds difficulty cards from model errors, applies dual-perspective generation for knowledge and reasoning issues, then gates for grounded items, and uses it to create Dynamic OneEval which reportedly produces bigger drops than the seed benchmarks on current LLMs. The approach tries to keep difficulty factors explicit so iteration stays actionable. What stands out as new is the specific combination of semi-structured cards with the dual view and gating step; prior dynamic benchmark work usually ramps up difficulty without this failure-seeded, controllable structure. It does a solid job framing the contamination problem and offering a practical seeding process from existing datasets that could help keep evaluations ahead of training data. The soft spots are mostly around evidence. The abstract claims substantially larger drops while retaining the original factors, yet supplies no numbers, no ablation on the gating, no human agreement rates on whether root causes are preserved, and no checks for new artifacts or shifted difficulties. That leaves the central result open to the exact concern in the stress-test note: the drops might come from synthesis changes rather than the intended preserved failures. If the full paper has those quantitative validations or side-by-side comparisons, it would close the gap; based on the description alone, the claim rests on the method outline. This is aimed at researchers building or using LLM reasoning benchmarks who need tools that stay fresh and targeted. A reader focused on evaluation methodology would get concrete ideas from the pipeline even if they want more proof on fidelity. I would send it to peer review because the idea is grounded enough in a real problem and the stages are described clearly enough for referees to assess and suggest improvements on the validation side.

Referee Report

3 major / 2 minor

Summary. The paper proposes StressEval, a three-stage failure-driven framework for dynamic benchmarking: (1) constructing a semi-structured difficulty card from observed LLM failures on knowledge-intensive reasoning tasks, (2) dual-perspective instance synthesis targeting knowledge gaps and reasoning breakdowns while aiming to preserve underlying difficulty factors, and (3) a gating mechanism to retain only grounded, unambiguous instances. Seeding from existing datasets, it produces Dynamic OneEval, which is claimed to induce substantially larger performance drops across SOTA LLMs than the original benchmarks while retaining explicit difficulty factors for more actionable iteration.

Significance. If the synthesized instances can be shown to preserve original failure root causes without introducing new artifacts or shifting difficulty, the framework would offer a useful advance over static or uncontrollably harder dynamic benchmarks by enabling targeted, controllable stress-testing of LLM reasoning. The explicit difficulty-card approach is a conceptual strength for interpretability.

major comments (3)

[§3.2] §3.2 (dual-perspective instance synthesis): the claim that synthesized instances preserve the exact failure root causes identified in the difficulty card is load-bearing for attributing larger drops to the original factors rather than new ambiguities or altered knowledge requirements, yet no quantitative validation (e.g., root-cause matching scores, side-by-side human comparisons, or artifact-injection checks) is provided.
[§3.3] §3.3 (gating mechanism): the gating step is described only at a high level as retaining 'grounded, unambiguous' items; without reported inter-annotator agreement, precision/recall on grounding, or ablation showing its effect on performance drops, it is impossible to rule out that drops arise from the gating itself rather than preserved difficulty.
[§4] §4 (experiments): the headline result of 'substantially larger performance drops' on Dynamic OneEval is stated without tables reporting exact deltas, model-by-model breakdowns, controls for instance length or lexical overlap, or error analysis linking drops back to the original difficulty cards.

minor comments (2)

[Abstract] Abstract: run-on sentences and missing punctuation (e.g., after 'contamination and overfitting especially on knowledge intensive reasoning tasks') reduce readability; consider breaking into shorter sentences.
[§4] The term 'Dynamic OneEval' is introduced without clarifying whether it is a single benchmark suite or multiple variants; a table summarizing the seeded source datasets and resulting instance counts would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We address each of the major comments in detail below, providing clarifications and committing to specific revisions that will strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§3.2] §3.2 (dual-perspective instance synthesis): the claim that synthesized instances preserve the exact failure root causes identified in the difficulty card is load-bearing for attributing larger drops to the original factors rather than new ambiguities or altered knowledge requirements, yet no quantitative validation (e.g., root-cause matching scores, side-by-side human comparisons, or artifact-injection checks) is provided.

Authors: We appreciate the referee's emphasis on validating the preservation of failure root causes. Our dual-perspective synthesis method explicitly targets the knowledge gaps and reasoning breakdowns specified in the difficulty card to maintain the core difficulty factors. While the manuscript includes illustrative examples demonstrating this targeting, we concur that quantitative validation is valuable. In the revised version, we will incorporate a human study involving side-by-side comparisons of original failures and synthesized instances, along with metrics for alignment to the difficulty cards and checks for introduced artifacts such as lexical overlap analysis. revision: yes
Referee: [§3.3] §3.3 (gating mechanism): the gating step is described only at a high level as retaining 'grounded, unambiguous' items; without reported inter-annotator agreement, precision/recall on grounding, or ablation showing its effect on performance drops, it is impossible to rule out that drops arise from the gating itself rather than preserved difficulty.

Authors: The gating mechanism serves to ensure that only instances meeting criteria for grounding in verifiable knowledge and lack of ambiguity are retained, thereby focusing the benchmark on the intended difficulty factors. The current description outlines the annotation guidelines, but we acknowledge the need for more transparency on its reliability and impact. For the revision, we will report inter-annotator agreement statistics for the gating process and include an ablation experiment comparing model performance on the full synthesized set versus the gated subset to isolate the effect of gating. revision: yes
Referee: [§4] §4 (experiments): the headline result of 'substantially larger performance drops' on Dynamic OneEval is stated without tables reporting exact deltas, model-by-model breakdowns, controls for instance length or lexical overlap, or error analysis linking drops back to the original difficulty cards.

Authors: We agree that the experimental section would benefit from greater detail to support the headline claims. The manuscript reports aggregate performance improvements in difficulty, but we will revise Section 4 to include comprehensive tables with per-model accuracy scores, exact deltas, and breakdowns. We will also add statistical controls for instance length and lexical overlap by reporting these properties across datasets and ensuring balanced distributions. Furthermore, we will expand the error analysis to explicitly link observed failures on Dynamic OneEval to the difficulty factors identified in the cards, providing concrete examples. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation independent of synthesis inputs

full rationale

The paper presents a descriptive three-stage synthesis framework (difficulty card construction, dual-perspective synthesis, gating) seeded from existing datasets to produce Dynamic OneEval, followed by direct empirical measurement of LLM performance drops on the new instances versus originals. No equations, fitted parameters, or first-principles derivations are present. The headline result (larger drops) is an observed outcome on external models, not a quantity forced by construction from the synthesis procedure or by self-citation chains. The framework does not rename known results, smuggle ansatzes, or import uniqueness theorems; it is self-contained as an independent data-generation method whose validity can be checked against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The framework rests on the assumption that model failures can be decomposed into identifiable reasoning steps whose root causes can be preserved in new instances; no free parameters or external constants are mentioned.

axioms (2)

domain assumption Observed LLM failures on knowledge-intensive tasks can be mapped to specific reasoning steps and root causes
Invoked in the first stage of difficulty-card construction
domain assumption Synthesized instances can retain original difficulty factors while remaining grounded and unambiguous
Required for the gating mechanism to produce usable benchmarks

invented entities (3)

difficulty card no independent evidence
purpose: Semi-structured record of failed reasoning step and root cause
Core new data structure introduced to guide synthesis
dual perspective instance synthesis no independent evidence
purpose: Method that separately targets knowledge gaps and reasoning breakdowns
Central synthesis technique of the framework
gating mechanism no independent evidence
purpose: Filter that retains only grounded unambiguous instances
Quality-control step to ensure benchmark validity

pith-pipeline@v0.9.0 · 5473 in / 1360 out tokens · 58678 ms · 2026-05-09T17:13:39.405878+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

(DrLee, affiliated_with, InstA)

work page
[2]

Which city is DrLee's affiliated institution located in?

(InstA, headquartered_in, MetroZ) Distractors: (DrLee, collaborated_with, ProfX), (ProfX, headquartered_in, OtherCity) Question: "Which city is DrLee's affiliated institution located in?", Gold answer: MetroZ, may fail due to distraction or incomplete multi-hop reasoning. Pattern 3: Name (Type): Path confusion Description: Multiple paths connect source to...

work page
[3]

(MayorK, parent_of, CouncilorM)

work page
[4]

Who has a family member employed by CityHall?

(CouncilorM, employed_by, CityHall) Short spurious path: (MayorK, met_with, CityHallRep) Question: "Who has a family member employed by CityHall?" Gold answer: MayorK (via path 1+2), Model may incorrectly choose CityHallRep due to shortest-path bias. Pattern 4: Name (Type): Relation directionality confusion Description: The model treats a directed relatio...

work page
[5]

(RiverX, flows_through, RegionY)

work page
[6]

(RegionY, contains, CityZ)

work page
[7]

Which river flows through RegionY?

(CityZ, near_river, RiverX) (weak semantics) Question: "Which river flows through RegionY?", model may confuse direction and answer incorrectly. Pattern 5: Name (Type): Temporal constraint Error 737 Description: The model misinterprets temporal expressions (before/after/during, inclusive boundaries), or parses/normalizes date values incorrectly (2010s -> ...

work page 2010
[8]

Shoreline

(EntityA, has_title, "Shoreline")

work page
[9]

Who created Shoreline?

(EntityB, born_in, CityC) Context ambiguously leads the model to treat EntityA as a person instead of a work. Question: "Who created Shoreline?", Gold answer: Artist, model may treat EntityA incorrectly as a person to output wrong answer Requirements: - "error_type": The exact English name of the selected error pattern from Pattern 1-7 (e.g., "Multi-hop C...

work page
[10]

Summarize the error in an abstract and generalizable form (use generalized, placeholder entities (e.g., Person_A, University_X, City_Y))

work page
[11]

Explain how this error pattern can be transferred and what's the difficulty

work page
[12]

transfer_conditions

Specify what characteristics or elements must be present in the newly generated context and question to ensure the error can be reproduced. The analysis of transfer should be included in "transfer_conditions". 738 Prompt for Table R stress (for generating error reports) You are an analyst that converts concrete WTQ-style TABLE QA failures into reusable, s...

work page
[13]

Reasoning requirement:

work page
[14]

false" if unique answer -

Anti-ambiguity constraints: - Surface pattern must describe concrete, checkable table/question cues (e.g., a year appears in a Date column; counting is required; entity only appears in the question and not as a column). - Reasoning requirement must describe the minimal operation sequence. - Failure signature must describe the typical wrong behavior (e.g.,...

work page
[15]

Output MUST be exactly ONE JSON object and nothing else (no markdown, no code fences)

work page
[16]

items":[{

Output schema must be exactly: {"items":[{"new_question":"...","recipe":"...","new_gold_answer":"...","clue":"..."}, ...]}

work page
[17]

Output exactly the number of items requested in the user message for this call (no more, no less)

work page
[18]

===================== ANTI-LEAK CONSTRAINTS =====================

Do NOT output any extra keys beyond: new_question, recipe, new_gold_answer, clue. ===================== ANTI-LEAK CONSTRAINTS =====================

work page
[19]

NEVER include the ORIGINAL GOLD ANSWER as a substring in any new_question (case-insensitive)

work page
[20]

NEVER include the ORIGINAL GOLD ANSWER inside the clue for SAME items (SAME clue must be empty anyway)

work page
[21]

SAME" - new_gold_answer: MUST equal the original gold answer exactly - clue: MUST be

For any item where recipe starts with "HOP:", new_gold_answer MUST be DIFFERENT from the original gold answer. ==================== ALLOWED RECIPE TYPES ==================== There are exactly two recipe families: A) SAME (answer unchanged; expanded/indirect phrasing) - recipe: "SAME" - new_gold_answer: MUST equal the original gold answer exactly - clue: M...

work page
[22]

abstract_error_template

PATTERN FIDELITY (MOST IMPORTANT): Faithfully instantiate the SAME error pattern described by the input "abstract_error_template" and "transfer_guidance"

work page
[23]

EVIDENCE ALIGNMENT + CLOSED WORLD: The gold answer must be uniquely derivable ONLY from the provided contexts, and the required evidence must be exactly captured by supporting_facts (title + 0-based sent_id)

work page
[24]

MULTI-HOP DIFFICULTY: The question must require >=2 reasoning steps across at least TWO different context titles, and must strongly tempt the target error pattern

work page
[25]

Absolutely NO real-world entities of any kind

FICTIONAL PROPER NAMES (NO PLACEHOLDERS REQUIRED): Use invented fictional proper names consistently. Absolutely NO real-world entities of any kind

work page
[26]

United States

STRICT JSON OUTPUT: Output one valid JSON object matching the required schema, with no extra text. INPUT: You will receive ONE error_report JSON that includes (among other fields): - abstract_error_template - transfer_guidance - error_type_canon (may exist) You MUST treat abstract_error_template and transfer_guidance as authoritative. YOUR TASK: Generate ...

work page
[27]

- What to learn for new example: Identify which entity, relation, or combination caused the error

error_details: - Purpose: Explain exactly why the model failed, comparing Gold Answer reasoning vs model output. - What to learn for new example: Identify which entity, relation, or combination caused the error. - How to apply: The new KG must include these critical elements or relations so that the error can be reproduced. This guides the placement of di...

work page
[28]

You will be given a list of knowledge graph triples. Answer the following question using the information in the triples

transfer_conditions: - Purpose: Specify how the error pattern can be transferred to a new context. - Structure: a) Abstract Error Form: - Role: Summarize the logical structure of the error with placeholder entities and relations (e.g., Person_A, Country_X, Organization_Y). - Guidance: New examples must reproduce this logical structure exactly, including t...

work page
[29]

- Form: entity_attribute or entity.attribute

Attribute Relations (entity_attribute) - VALUE is a literal (not another entity). - Form: entity_attribute or entity.attribute. - Examples: city_population, state_governor, economy_gdp

work page
[30]

- Form: entity_relation_entity or entity.verb.entity

Entity-to-Entity Relations (entity_relation_entity) - VALUE is another entity. - Form: entity_relation_entity or entity.verb.entity. - Examples: located_in, has_state, scientist_affiliated_with

work page
[31]

|" inside a triple - Separate triples using

Hierarchical Attribute Relations (entity.entity.attribute) - VALUE is typically a literal. - Form: entity.entity.attribute. - Examples: location.country.capital, location.country.language. - Use a single space between ENTITY, RELATION, and VALUE - Do NOT use "|" inside a triple - Separate triples using "|" as a delimiter - Include a section "question:" wi...

work page
[32]

New_example

You must generate the correct answer (gold_answer) for the "New_example" and store it in "New_example_gold_answer"

work page
[33]

To ensure the correctness and executability of the gold_answer, you must perform internal step-by-step reasoning when generating the answer; however, do NOT output the reasoning process

work page
[34]

You must refer to the specified error_type, details and the transfer_conditions, and explicitly avoid the described failure cases/situation when reasoning and constructing the gold_answer, so that the output represents the correct logical form for subsequent evaluation

work page
[35]

If doesn't exist, return "None"

If multiple correct answers exist, separate them with commas. If doesn't exist, return "None". Do NOT output anything other than the correct answer(s) -- this includes explanations, evidence, reasoning steps, or any additional text. Output: Generate 1 separate KG-QA examples (with question) and its gold_answer following the above rules. Each example MUST ...

work page
[36]

trigger" and

PATTERN FIDELITY (MOST IMPORTANT): Faithfully instantiate the SAME reasoning error pattern described by the input fields: "trigger" and "transfer_guidance" (authoritative)

work page
[37]

Do NOT require any external knowledge

CLOSED WORLD + UNIQUE ANSWER: The gold answer must be uniquely derivable ONLY from the provided table. Do NOT require any external knowledge

work page
[38]

TRAP STRENGTH: The table + question must strongly tempt the target failure signature implied by the pattern (e.g., overstrict unknown, filter drop, wrong argmax target, unit normalization skip)

work page
[39]

Absolutely NO real-world entities of any kind

FICTIONAL PROPER NAMES (ABSOLUTELY REQUIRED): 753 Use invented fictional proper names consistently. Absolutely NO real-world entities of any kind

work page
[40]

pattern_seed

STRICT JSON OUTPUT: Output one valid JSON object matching the required schema, with no extra text. INPUT: You will receive ONE "pattern_seed" JSON object that includes: - case_id - reasoning_family - required_ops - bottleneck_step - trigger - transfer_guidance (Other fields may exist. Treat trigger + transfer_guidance as authoritative.) HARD CONSTRAINT: N...

work page 1901

[1] [1]

(DrLee, affiliated_with, InstA)

work page

[2] [2]

Which city is DrLee's affiliated institution located in?

(InstA, headquartered_in, MetroZ) Distractors: (DrLee, collaborated_with, ProfX), (ProfX, headquartered_in, OtherCity) Question: "Which city is DrLee's affiliated institution located in?", Gold answer: MetroZ, may fail due to distraction or incomplete multi-hop reasoning. Pattern 3: Name (Type): Path confusion Description: Multiple paths connect source to...

work page

[3] [3]

(MayorK, parent_of, CouncilorM)

work page

[4] [4]

Who has a family member employed by CityHall?

(CouncilorM, employed_by, CityHall) Short spurious path: (MayorK, met_with, CityHallRep) Question: "Who has a family member employed by CityHall?" Gold answer: MayorK (via path 1+2), Model may incorrectly choose CityHallRep due to shortest-path bias. Pattern 4: Name (Type): Relation directionality confusion Description: The model treats a directed relatio...

work page

[5] [5]

(RiverX, flows_through, RegionY)

work page

[6] [6]

(RegionY, contains, CityZ)

work page

[7] [7]

Which river flows through RegionY?

(CityZ, near_river, RiverX) (weak semantics) Question: "Which river flows through RegionY?", model may confuse direction and answer incorrectly. Pattern 5: Name (Type): Temporal constraint Error 737 Description: The model misinterprets temporal expressions (before/after/during, inclusive boundaries), or parses/normalizes date values incorrectly (2010s -> ...

work page 2010

[8] [8]

Shoreline

(EntityA, has_title, "Shoreline")

work page

[9] [9]

Who created Shoreline?

(EntityB, born_in, CityC) Context ambiguously leads the model to treat EntityA as a person instead of a work. Question: "Who created Shoreline?", Gold answer: Artist, model may treat EntityA incorrectly as a person to output wrong answer Requirements: - "error_type": The exact English name of the selected error pattern from Pattern 1-7 (e.g., "Multi-hop C...

work page

[10] [10]

Summarize the error in an abstract and generalizable form (use generalized, placeholder entities (e.g., Person_A, University_X, City_Y))

work page

[11] [11]

Explain how this error pattern can be transferred and what's the difficulty

work page

[12] [12]

transfer_conditions

Specify what characteristics or elements must be present in the newly generated context and question to ensure the error can be reproduced. The analysis of transfer should be included in "transfer_conditions". 738 Prompt for Table R stress (for generating error reports) You are an analyst that converts concrete WTQ-style TABLE QA failures into reusable, s...

work page

[13] [13]

Reasoning requirement:

work page

[14] [14]

false" if unique answer -

Anti-ambiguity constraints: - Surface pattern must describe concrete, checkable table/question cues (e.g., a year appears in a Date column; counting is required; entity only appears in the question and not as a column). - Reasoning requirement must describe the minimal operation sequence. - Failure signature must describe the typical wrong behavior (e.g.,...

work page

[15] [15]

Output MUST be exactly ONE JSON object and nothing else (no markdown, no code fences)

work page

[16] [16]

items":[{

Output schema must be exactly: {"items":[{"new_question":"...","recipe":"...","new_gold_answer":"...","clue":"..."}, ...]}

work page

[17] [17]

Output exactly the number of items requested in the user message for this call (no more, no less)

work page

[18] [18]

===================== ANTI-LEAK CONSTRAINTS =====================

Do NOT output any extra keys beyond: new_question, recipe, new_gold_answer, clue. ===================== ANTI-LEAK CONSTRAINTS =====================

work page

[19] [19]

NEVER include the ORIGINAL GOLD ANSWER as a substring in any new_question (case-insensitive)

work page

[20] [20]

NEVER include the ORIGINAL GOLD ANSWER inside the clue for SAME items (SAME clue must be empty anyway)

work page

[21] [21]

SAME" - new_gold_answer: MUST equal the original gold answer exactly - clue: MUST be

For any item where recipe starts with "HOP:", new_gold_answer MUST be DIFFERENT from the original gold answer. ==================== ALLOWED RECIPE TYPES ==================== There are exactly two recipe families: A) SAME (answer unchanged; expanded/indirect phrasing) - recipe: "SAME" - new_gold_answer: MUST equal the original gold answer exactly - clue: M...

work page

[22] [22]

abstract_error_template

PATTERN FIDELITY (MOST IMPORTANT): Faithfully instantiate the SAME error pattern described by the input "abstract_error_template" and "transfer_guidance"

work page

[23] [23]

EVIDENCE ALIGNMENT + CLOSED WORLD: The gold answer must be uniquely derivable ONLY from the provided contexts, and the required evidence must be exactly captured by supporting_facts (title + 0-based sent_id)

work page

[24] [24]

MULTI-HOP DIFFICULTY: The question must require >=2 reasoning steps across at least TWO different context titles, and must strongly tempt the target error pattern

work page

[25] [25]

Absolutely NO real-world entities of any kind

FICTIONAL PROPER NAMES (NO PLACEHOLDERS REQUIRED): Use invented fictional proper names consistently. Absolutely NO real-world entities of any kind

work page

[26] [26]

United States

STRICT JSON OUTPUT: Output one valid JSON object matching the required schema, with no extra text. INPUT: You will receive ONE error_report JSON that includes (among other fields): - abstract_error_template - transfer_guidance - error_type_canon (may exist) You MUST treat abstract_error_template and transfer_guidance as authoritative. YOUR TASK: Generate ...

work page

[27] [27]

- What to learn for new example: Identify which entity, relation, or combination caused the error

error_details: - Purpose: Explain exactly why the model failed, comparing Gold Answer reasoning vs model output. - What to learn for new example: Identify which entity, relation, or combination caused the error. - How to apply: The new KG must include these critical elements or relations so that the error can be reproduced. This guides the placement of di...

work page

[28] [28]

You will be given a list of knowledge graph triples. Answer the following question using the information in the triples

transfer_conditions: - Purpose: Specify how the error pattern can be transferred to a new context. - Structure: a) Abstract Error Form: - Role: Summarize the logical structure of the error with placeholder entities and relations (e.g., Person_A, Country_X, Organization_Y). - Guidance: New examples must reproduce this logical structure exactly, including t...

work page

[29] [29]

- Form: entity_attribute or entity.attribute

Attribute Relations (entity_attribute) - VALUE is a literal (not another entity). - Form: entity_attribute or entity.attribute. - Examples: city_population, state_governor, economy_gdp

work page

[30] [30]

- Form: entity_relation_entity or entity.verb.entity

Entity-to-Entity Relations (entity_relation_entity) - VALUE is another entity. - Form: entity_relation_entity or entity.verb.entity. - Examples: located_in, has_state, scientist_affiliated_with

work page

[31] [31]

|" inside a triple - Separate triples using

Hierarchical Attribute Relations (entity.entity.attribute) - VALUE is typically a literal. - Form: entity.entity.attribute. - Examples: location.country.capital, location.country.language. - Use a single space between ENTITY, RELATION, and VALUE - Do NOT use "|" inside a triple - Separate triples using "|" as a delimiter - Include a section "question:" wi...

work page

[32] [32]

New_example

You must generate the correct answer (gold_answer) for the "New_example" and store it in "New_example_gold_answer"

work page

[33] [33]

To ensure the correctness and executability of the gold_answer, you must perform internal step-by-step reasoning when generating the answer; however, do NOT output the reasoning process

work page

[34] [34]

You must refer to the specified error_type, details and the transfer_conditions, and explicitly avoid the described failure cases/situation when reasoning and constructing the gold_answer, so that the output represents the correct logical form for subsequent evaluation

work page

[35] [35]

If doesn't exist, return "None"

If multiple correct answers exist, separate them with commas. If doesn't exist, return "None". Do NOT output anything other than the correct answer(s) -- this includes explanations, evidence, reasoning steps, or any additional text. Output: Generate 1 separate KG-QA examples (with question) and its gold_answer following the above rules. Each example MUST ...

work page

[36] [36]

trigger" and

PATTERN FIDELITY (MOST IMPORTANT): Faithfully instantiate the SAME reasoning error pattern described by the input fields: "trigger" and "transfer_guidance" (authoritative)

work page

[37] [37]

Do NOT require any external knowledge

CLOSED WORLD + UNIQUE ANSWER: The gold answer must be uniquely derivable ONLY from the provided table. Do NOT require any external knowledge

work page

[38] [38]

TRAP STRENGTH: The table + question must strongly tempt the target failure signature implied by the pattern (e.g., overstrict unknown, filter drop, wrong argmax target, unit normalization skip)

work page

[39] [39]

Absolutely NO real-world entities of any kind

FICTIONAL PROPER NAMES (ABSOLUTELY REQUIRED): 753 Use invented fictional proper names consistently. Absolutely NO real-world entities of any kind

work page

[40] [40]

pattern_seed

STRICT JSON OUTPUT: Output one valid JSON object matching the required schema, with no extra text. INPUT: You will receive ONE "pattern_seed" JSON object that includes: - case_id - reasoning_family - required_ops - bottleneck_step - trigger - transfer_guidance (Other fields may exist. Treat trigger + transfer_guidance as authoritative.) HARD CONSTRAINT: N...

work page 1901