pith. sign in

arxiv: 2604.22294 · v1 · submitted 2026-04-24 · 💻 cs.CL · cs.AI

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

Pith reviewed 2026-05-08 11:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords long document question answeringstructured reasoningrelational databaseSQL-based reasoningdata reconciliationLLM context limitsmulti-document synthesis
0
0 comments X

The pith

SLIDERS extracts salient information from long documents into a relational database and reasons over it with SQL to answer questions at scales far beyond any LLM context window.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that decomposing documents into chunks and aggregating their outputs creates an insurmountable bottleneck as collections grow, even when each chunk fits inside a strong LLM. Instead, SLIDERS pulls key facts into a persistent relational schema, then reconciles duplicates, inconsistencies, and gaps using provenance and metadata before executing SQL queries over the structured state. This approach outperforms strong baselines, including GPT-4.1, on existing long-context benchmarks that already fit in model windows and delivers large gains on new tests at 3.9 million and 36 million tokens.

Core claim

SLIDERS extracts salient information from documents into a relational database, reconciles the resulting records for global coherence using provenance, extraction rationales, and metadata, and performs all subsequent reasoning through SQL queries over this persistent structured representation rather than over concatenated text.

What carries the argument

The SLIDERS extraction-to-relational-schema pipeline followed by a provenance-driven reconciliation stage that produces a queryable database.

If this is right

  • Question-answering systems no longer need ever-larger context windows to handle growing document collections.
  • Reasoning can remain precise even when the total input exceeds millions of tokens because SQL operates on a compact, deduplicated state.
  • The same structured store can support multiple queries and iterative refinement without reprocessing the original documents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same extraction-plus-reconciliation pattern could be applied to other long-input tasks such as multi-document summarization or timeline construction.
  • Hybrid systems might combine SLIDERS-style structured stores with direct LLM access to the original text for verification of edge cases.

Load-bearing premise

Salient information can be extracted from documents into a relational schema and reconciled without losing critical reasoning details or introducing errors that affect downstream answers.

What would settle it

A long-document benchmark in which the required answer depends on a subtle cross-document inference that the relational extraction step systematically drops or distorts, causing SLIDERS accuracy to fall below text-based baselines.

Figures

Figures reproduced from arXiv: 2604.22294 by Harshit Joshi, Jadelynn Dao, Monica S. Lam, Priyank Shethia.

Figure 1
Figure 1. Figure 1: Accuracy across long-context and ultra-long benchmarks. SLIDERS consistently outper view at source ↗
Figure 2
Figure 2. Figure 2: Chunking based methods regenerate the long-context problem. SLIDERS mitigates it by view at source ↗
Figure 3
Figure 3. Figure 3: SLIDERS overview showing the process of (1) Contextualization of chunks, (2) Inducing the schema, (3) Structured Extraction, (4) Data Reconciliation, and (5) Question Answering over the final database. An example from the Loong benchmark. and resolves inconsistencies into a unified representation. Our key insight is that this process can exploit how each record was produced: every entry is stored with its … view at source ↗
Figure 4
Figure 4. Figure 4: Model accuracy across difficulty levels and question types. view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy vs. token length across benchmarks. view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of the reconciliation agent across seven datasets: (a) density of per-group row view at source ↗
read the original abstract

Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces SLIDERS, a framework for scalable question answering over long document collections. It extracts salient information from documents into a relational database, applies a reconciliation stage using provenance, extraction rationales, and metadata to resolve duplicates, inconsistencies, and incompletenesses, and then performs reasoning over the structured state using SQL queries rather than text concatenation or direct long-context inference. The central empirical claims are that SLIDERS outperforms all baselines on three existing long-context QA benchmarks (exceeding GPT-4.1 by 6.6 points on average) and delivers gains of ~19 and ~32 points on two newly introduced benchmarks at 3.9M and 36M tokens.

Significance. If the results hold under scrutiny, the work offers a promising direction for handling document collections that exceed any fixed context window while mitigating aggregation bottlenecks. The structured extraction-plus-SQL approach could provide more reliable multi-document reasoning than chunk-based or direct long-context methods, and the new large-scale benchmarks constitute a useful community resource.

major comments (3)
  1. [Methods] Methods section: The relational schema for extraction, the LLM prompts or procedures used to populate it, and the precise reconciliation algorithm are not described in sufficient detail. This is load-bearing for the central claim because the reported gains (including on short-context benchmarks where scale is not an issue) rest entirely on the fidelity of extraction and reconciliation; any systematic loss of nuanced cross-document implications would undermine the performance margins.
  2. [Experiments] Experiments section: No information is supplied on baseline implementations, exact evaluation metrics, number of runs, or statistical significance testing for the reported deltas (e.g., the 6.6-point average improvement over GPT-4.1 or the 19- and 32-point gains on the new benchmarks). Without these, the empirical claims cannot be verified or compared.
  3. [Abstract and §4] Abstract and §4 (results on existing benchmarks): The outperformance on three existing benchmarks that already fit within strong LLM context windows implies that the structured extraction/reconciliation step itself improves answer quality. The manuscript should include ablations or error analysis showing whether this step preserves or discards critical reasoning details, as the skeptic concern about loss of implicit relations is directly relevant to interpreting the margins.
minor comments (1)
  1. [Abstract] Abstract: Explicitly listing the baselines and the sizes of the three existing benchmarks would improve clarity for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important gaps in reproducibility and analysis that we will address through targeted revisions. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Methods] Methods section: The relational schema for extraction, the LLM prompts or procedures used to populate it, and the precise reconciliation algorithm are not described in sufficient detail. This is load-bearing for the central claim because the reported gains (including on short-context benchmarks where scale is not an issue) rest entirely on the fidelity of extraction and reconciliation; any systematic loss of nuanced cross-document implications would undermine the performance margins.

    Authors: We agree that the current Methods section lacks sufficient detail for full reproducibility and to substantiate the performance claims. In the revised manuscript we will expand this section to include: (1) the complete relational schema with table definitions and field descriptions; (2) the full extraction prompts (with few-shot examples) used to populate each table; and (3) a precise, step-by-step description of the reconciliation algorithm, including pseudocode that shows how provenance, extraction rationales, and metadata are combined to detect and resolve duplicates, inconsistencies, and incompletenesses. These additions will directly address concerns about potential loss of nuanced cross-document relations. revision: yes

  2. Referee: [Experiments] Experiments section: No information is supplied on baseline implementations, exact evaluation metrics, number of runs, or statistical significance testing for the reported deltas (e.g., the 6.6-point average improvement over GPT-4.1 or the 19- and 32-point gains on the new benchmarks). Without these, the empirical claims cannot be verified or compared.

    Authors: We acknowledge this omission. The revised Experiments section will add: (1) detailed descriptions of all baseline implementations, including how GPT-4.1 and other models were prompted and any chunking strategies used; (2) the precise evaluation metrics (e.g., exact match, token-level F1); (3) the number of runs performed (we will report results averaged over five independent runs with different random seeds); and (4) statistical significance testing, including paired t-tests and 95% confidence intervals for all reported deltas. These changes will make the empirical claims verifiable and comparable. revision: yes

  3. Referee: [Abstract and §4] Abstract and §4 (results on existing benchmarks): The outperformance on three existing benchmarks that already fit within strong LLM context windows implies that the structured extraction/reconciliation step itself improves answer quality. The manuscript should include ablations or error analysis showing whether this step preserves or discards critical reasoning details, as the skeptic concern about loss of implicit relations is directly relevant to interpreting the margins.

    Authors: We agree that the gains on short-context benchmarks warrant explicit analysis of the structured step's contribution. In the revised §4 we will add: (1) an ablation study comparing direct long-context LLM inference, extraction without reconciliation, and full SLIDERS; and (2) a qualitative error analysis that categorizes successes and failures with respect to preservation of implicit cross-document relations, supported by representative examples. This will clarify whether the structured approach preserves or discards critical reasoning details. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external benchmarks

full rationale

The paper presents SLIDERS as a system that extracts document content into a relational schema, reconciles records using provenance/metadata, and then answers via SQL. All load-bearing claims are empirical: measured accuracy on three existing long-context QA benchmarks (where inputs fit in LLM windows) plus two new large-scale sets (3.9 M and 36 M tokens). These are compared against external baselines including GPT-4.1. No equations, fitted parameters, or first-principles derivations appear. No self-citations are invoked to justify uniqueness or to close a logical loop. The extraction/reconciliation step is an engineering choice whose correctness is tested by end-to-end accuracy rather than assumed by definition. Therefore the reported gains do not reduce to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that document content can be losslessly projected into a relational schema and that provenance metadata suffices to resolve inconsistencies.

axioms (1)
  • domain assumption Salient information from documents can be extracted into a relational schema without significant loss of reasoning capability.
    The framework replaces text-based reasoning with SQL queries over this extracted state.

pith-pipeline@v0.9.0 · 5527 in / 1206 out tokens · 59919 ms · 2026-05-08T11:46:24.698448+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

  1. [1]

    Return the second poem about the Great Wall of China

    ISSN 2150-8097. doi: 10.14778/3746405.3746426. URL https://doi.org/10.14778/ 3746405.3746426. L. B. Sollaci and M. G. Pereira. The introduction, methods, results, and discussion (imrad) structure: a fifty-year survey.Journal of the medical library association, 92(3):364, 2004. R. Van De Schoot, J. De Bruin, R. Schram, P. Zahedi, J. De Boer, F. Weijdema, B...

  2. [2]

    Show the trend of artist debuts from earliest to latest decade - is it increasing, decreasing, or stable? Answer:increasing 8.Which industry tends to have artists debuting earlier on average - Music or Film? Answer:Film 9.What are the top 2 countries for artist births? Answer:United States, India 10.Among artists born outside the US, which countries contr...

  3. [3]

    Is the Music industry growing or declining over time compared to Film? Compare debuts by decade. Answer:declined marginally till 1980 and then grew steadily 15.Among artists who work in multiple industries, is it more common to start in Music then move to Film, or vice versa? Answer:mmore 16.What is the typical gap in years between an artist’s Music debut...

  4. [4]

    min_borrowings

    Is there a trend showing Content Creation becoming more common in recent decades compared to traditional industries? Answer:Relatively more common in recent decades but still not as common as music or film 21.Which artist debuted at the youngest age? Answer:Raven-Symon Christina Pearman-Maday 22.Which artist debuted at the youngest age in each industry? A...

  5. [5]

    What is the cumulative percentage of total borrowings as we go from smallest to largest companies? Answer:CARMAX INC, MICRON TECHNOLOGY INC, AUTOZONE INC, Salesforce, Inc., COSTCO WHOLESALE CORP /NEW,

  6. [6]

    Among companies with CIK numbers, which sectors (based on CIK ranges) have higher average borrowings? Answer:Mid registrants 17.What is the total long-term borrowings across all companies in the dataset? Answer:102590.76

  7. [7]

    For TJX Companies which has multiple periods, did their long-term borrowings increase or decrease over time? Answer:decreased 19.What percentage of companies have long-term borrowings under 50 million? Answer:53.7

  8. [8]

    How concentrated is the debt? What percentage of companies hold 80% of the total borrowings? Answer:15

  9. [9]

    millions

    Are companies in the ’Recent registrants’ CIK range (1M-1.5M) more likely to have zero debt compared to ’Mid registrants’ (500K-1M)? Answer:Recent 24 22.Among companies with borrowings over 1 billion, which has the lowest borrowings? Answer:ROSS STORES, INC. 23.Do more companies have borrowings above or below 100 million? Answer:Below 100M 24.What are the...

  10. [10]

    **In ‘verbalization‘ field**: List ALL target entities from the question you’re looking for - **CRITICAL**: Include EVERY entity mentioned ANYWHERE in the question---this includes the primary entity being analyzed AND all comparison/related entities - Example: If the question asks about relationships involving entities X, Y, and Z, then your target entiti...

  11. [11]

    **Identify the list boundaries** - Find where the list starts and ends

  12. [12]

    to the last entry)

    **Process EVERY entry sequentially** - Go through each numbered/bulleted item one by one (e.g., [1], [2], [3]... to the last entry)

  13. [13]

    Entry [N] ’identifier/name found here’ - MATCH to <target entity name>

    **In ‘verbalization‘ field**: For EACH AND EVERY entry, you MUST document: - **EXPLICITLY LIST EACH ENTRY** - Write: "Entry [N] ’identifier/name found here’ - MATCH to <target entity name>" OR "Entry [N] ’identifier/name found here’ - no match" - **DO NOT SUMMARIZE** - List each entry individually even if there are 100+ entries 27 - **CRITICAL**: Even if ...

  14. [14]

    The verbalization is your working notes; the tables section is where you record the actual data

    **In ‘tables‘ field**: Extract a row immediately when you find a match (High/Very High confidence, with concise rationale) - **CRITICAL**: For EVERY entry you marked as "MATCH" in your verbalization, you MUST create a corresponding extraction row. The verbalization is your working notes; the tables section is where you record the actual data. DO NOT ident...

  15. [15]

    **Continue to the end** - Do not stop after finding matches; process every remaining entry

  16. [16]

    Completed all 92 entries

    **In ‘verbalization‘ field**: Confirm you reached and checked the final entry (e.g., "Completed all 92 entries")

  17. [17]

    X relates to Y

    **CRITICAL - Understanding Entity Relationships in Lists**: When processing a source entity’s list of related entities: - Finding target entity Y in source entity X’s list means "X relates to Y" - EXTRACT THIS - Whether Y appears in the question is IRRELEVANT to whether this is a valid relationship - ONLY skip extraction when source equals target (i.e., X...

  18. [18]

    References

    **verbalization** (optional string): For sections with enumerated lists (such as dependency lists, related entity lists, "References", "Bibliography", "Citations", component inventories, or any numbered/bulleted entity listings ), describe your systematic checking process: - List ALL entities mentioned ANYWHERE in the question---include the primary entity...

  19. [19]

    **CRITICAL**: If you identified ANY matches (marked as "MATCH") in your verbalization, you MUST create a corresponding extraction row in the tables array for EACH match

    **tables** (required array): The extracted data following the schema above, with concise rationales per field. **CRITICAL**: If you identified ANY matches (marked as "MATCH") in your verbalization, you MUST create a corresponding extraction row in the tables array for EACH match. The verbalization is your working notes---the tables section is where you re...

  20. [20]

    {% endif %} Fill out the schema above based on the Document content

    **tables** (required array): The extracted data following the schema above, with concise rationales per field. {% endif %} Fill out the schema above based on the Document content. D.1.3 Data Reconciliation Primary Key SelectionSelects primary key fields which are used to group data rows for further reconciliation. # instruction You are selecting **primary...

  21. [21]

    **Deduplication**: Rows with the same primary key will be identified as duplicates

  22. [22]

    **Conflict Resolution**: When duplicates exist, we’ll merge them by selecting the best value for each field

  23. [23]

    same thing

    **Table Consolidation**: The final cleaned table will have one row per unique primary key ## Context - The data was automatically extracted from unstructured documents (PDFs, text, etc.) - Extraction errors create duplicate rows for the same real-world observation - Multiple extractions of the same data point may have conflicting values - Your primary key...

  24. [24]

    **First inspection**: Check ALL columns to find entity relationships

  25. [25]

    **Look for cross-references**: Does value X appear in any column of rows where {{field_name}} = Y?

  26. [26]

    **If values appear related**, canonicalize them to one form

  27. [27]

    **Apply normalization** (case, punctuation, whitespace) for remaining variations

  28. [28]

    **Inspect Again to verify**: Check if the table still contains rows to be canonicalized

  29. [29]

    inspect" sql:

    **Stop** when no more relationships or variations exist ## Skip If - Free-text / natural language content - High cardinality with unique values - Long verbose strings (>50 chars average) ## Canonicalization SQL Format (CRITICAL) **You must REPLACE the column, not add a new one.** Do NOT use ‘SELECT *‘ for canonicalization. **CORRECT** - explicitly list co...

  30. [30]

    [OK] **Primary Key Selected**: A semantic primary key has been identified based on the schema and data domain

  31. [31]

    [OK] **Primary Key Fields Canonicalized**: Values in the primary key fields have been standardized (case normalization, whitespace trimming, format consistency)

  32. [32]

    look different

    [OK] **Table Split by Primary Key**: Rows have been grouped by primary key values **Primary Key for this table**: {{primary_key}} **Your specific primary key value**: {{pk_value}} You are seeing ONLY the rows with this specific primary key value (typically 2-5 rows, occasionally more). ## Your Position in the Loop You are in iteration {{iteration}} of {{m...

  33. [33]

    **Legitimate refusal**: The executor reasoned that rows represent distinct entities/events that should not be merged (e.g., multiple separate occurrences, different aspects that shouldn’t be combined) $\rightarrow$ route to ** stop**

  34. [34]

    $\rightarrow$ **route** At each call, you decide:

    **Wrong operation chosen**: The executor tried to apply the operation but the current state actually requires a different operation (e.g., routed to consolidate but there are conflicts to resolve, or routed to deduplicate but there’s complementary information to aggregate) $\rightarrow$ route to the **appropriate operation** that addresses the actual data...

  35. [35]

    **inspect**: Run a SQL query to understand the current state better

  36. [36]

    Use metadata columns (‘*_quote‘, ‘*_rationale‘, ‘_confidence‘, ‘_is_explicit‘) to understand the context and semantics of the main data

    **route**: Route to a specific reconciliation operation (or stop) ## Available Route Actions **Important**: When deciding which operation to route to, focus on **main data columns** (measured values like names, dates, amounts, descriptions). Use metadata columns (‘*_quote‘, ‘*_rationale‘, ‘_confidence‘, ‘_is_explicit‘) to understand the context and semant...

  37. [37]

    **deduplicate** - Trigger when: Multiple rows with redundant information in non-primary-key fields - Purpose: Remove redundant rows, keeping the best representative - Executor will select the best row based on confidence, completeness, and data quality - Note: If a key column has non-redundant information (different values, NULLs to fill, or conflicts), r...

  38. [38]

    **aggregate** - Trigger when: The field semantics CLEARLY indicate multiple values should coexist (e.g., list of all products, sum across categories) - Use sparingly: Only when you’re confident the field is meant to hold multiple values, not when you’re trying to avoid making a choice - Purpose: Combine multiple valid values into a unified representation ...

  39. [39]

    **consolidate** - Trigger when: Information is scattered across rows with incomplete data that can be filled by copying values from other rows - Purpose: Broadcast complete information to fill gaps where data is missing - Executor will propagate consistent values across rows without combining or transforming them - Note: Returns multiple rows; use when in...

  40. [40]

    **resolve_conflicts** - Trigger when: Different values exist for a field, and only ONE is correct or most appropriate - Default choice: When uncertain whether to aggregate or resolve, choose resolve_conflicts - Purpose: Select the single best value based on evidence, source quality, semantic fit, and specificity - Executor will evaluate which value to kee...

  41. [41]

    5M" vs "5000000

    **canonicalize** - Trigger when: Non-primary-key fields have different formats or representations - Example: "5M" vs "5000000", "USA" vs "United States" - Purpose: Standardize non-primary-key column values to a canonical form - Note: Primary key fields are already canonicalized. NULL values should generally be preserved

  42. [42]

    mostly done

    **stop** - Trigger when: * **Ideal case**: Successfully reduced to 1 row * **Acceptable case**: After diagnosing the most recent executor result, you determine the executor legitimately refused to merge because rows represent genuinely distinct observations (not because the wrong operation was chosen) - Before stopping with >1 row remaining: Analyze the e...

  43. [43]

    [OK] **Primary Key Selected**: A semantic primary key has been identified

  44. [44]

    consolidate

    [OK] **Primary Key Fields Canonicalized**: Primary key values have been standardized **Primary Key for this table**: {{primary_key}} 36 **Your specific primary key value**: {{pk_value}} You are only seeing rows with this specific primary key value. Typically 2-5 rows, but occasionally more. ## Your Task You have been assigned the objective: **{{objective}...

  45. [45]

    **All rows redundant**: All rows have the same values across main data columns - Example: 5 rows all describing the same entity with the same attributes (Company X, founded 2010, headquarters Chicago, CEO John Smith) extracted from 5 different pages - All 5 rows form one redundancy group because they contain the same complete set of facts - **Action neede...

  46. [46]

    **Partial redundancy**: Some rows share the same complete set of main data values (forming redundant subsets), while other rows differ in at least one main data column - Example: 5 rows describing purchases where 3 have identical values (Item A, $50, 2024-01-10, Store X) and 2 have different values (Item B, $30, 2024-01-11, Store Y) and (Item A, $50, 2024...

  47. [47]

    what happened

    **No redundancy**: Each row has different values in at least one main data column - Example: 3 rows describing different product variants (Color: Red, Size: M), (Color: Blue, Size: L), (Color: Red, Size: L) - No redundancy groups exist because each row has a unique combination of values - **Action needed**: Return all rows unchanged using ‘SELECT * FROM {...

  48. [48]

    From each redundancy group, select the best representative row (most complete, best quote)

  49. [49]

    Set ‘number_instances‘ appropriately for each representative

  50. [50]

    Include all rows that are not part of any redundancy group (distinct rows)

  51. [51]

    Name A", 2 contain distinct alternate names

    Return the deduplicated set **Example 1: All rows redundant** (5 rows all contain same artist birth date extracted from different pages) Inspection reveals: All 5 rows have identical main data (artist_name, date_of_birth, etc.), only differ in metadata Decision: Extraction redundancy (birth date is a singular fact) Action: Return 1 row with number_instanc...

  52. [52]

    **First inspect**: Determine which columns have multiple values that should be combined

  53. [53]

    **Then generate SQL**: Use window functions to compute aggregated values and broadcast to all rows

  54. [54]

    Columns that don’t need aggregation remain unchanged

  55. [55]

    consolidate

    Return ALL rows with aggregations applied where needed **SQL Pattern using window functions**: ‘‘‘sql -- Apply aggregation to specific columns that need combining, broadcast result to all rows -- Use * EXCLUDE to avoid listing all columns SELECT * EXCLUDE(col_a, col_a_quote, col_b), -- Aggregate multi-valued column across all rows in this PK group STRING_...

  56. [56]

    **First inspect**: Examine all rows to understand which row has the best value for each column 39

  57. [57]

    **Then generate SQL**: For each consolidatable column, choose the specific row_id with the most reliable value

  58. [58]

    Use subqueries or window functions to broadcast from those chosen rows

  59. [59]

    Return ALL input rows with NULLs filled ONLY in columns that meet consolidation criteria

  60. [60]

    resolve_conflicts

    For columns with multiple distinct non-NULL values, preserve them exactly as-is **Pattern 1: Selective broadcasting with subqueries** ‘‘‘sql -- After inspecting, substitute actual row_ids you identified as having the best values -- Use * EXCLUDE for cleaner SQL that avoids listing all columns SELECT t.* EXCLUDE(col_a, col_a_quote, col_b, col_b_quote), -- ...

  61. [61]

    **Field definition alignment**: Prefer values that match the semantic scope of the field - Choose values that directly answer what the field asks for, not broader or narrower variants - Match the specificity: if the field has modifiers or constraints, select values that align with those constraints - When the field name corresponds to a standard reporting...

  62. [62]

    **Extraction location relevance**: Prefer values extracted from locations that directly serve the field’s purpose - Consider which document sections or structures are authoritative for this type of information - Prioritize locations where this field would naturally appear in its primary context - When available, examine text_header or similar metadata to ...

  63. [63]

    **Source authority**: - Direct measurements or final values over constituent parts or supplementary details - Structured presentations (tables, statements, forms, labeled fields) over descriptive text (discussions, notes, commentary) - Primary data sections over explanatory sections - Explicit field labels over derived mentions

  64. [64]

    **Quote quality**: Prefer direct quotes over inferred values

  65. [65]

    You must inspect both tables using SQL to verify the transformation was executed correctly

    **Rationale clarity**: Prefer clear, specific rationale - **Preferred approach**: After inspection, identify the specific row_id you want to keep and filter directly: ‘SELECT * FROM {{table_name}} WHERE row_id = <chosen_row_id>‘ 40 - Alternative: Use ROW_NUMBER() with ORDER BY, but be careful with tiebreakers {% endif %} # Input {% if verification_mode %}...

  66. [66]

    **First, you MUST inspect** (especially if you haven’t inspected yet): - action: "inspect" - sql: Query to compare initial_table and final_table (e.g., SELECT * from both, or JOIN them) - reasoning: What transformation aspects you’re verifying (row count changes, column transformations, etc.) 41

  67. [67]

    regenerate

    **After inspecting and verbalizing what you observed, then decide:** **Option A: Approve** - The SQL correctly executed the intended transformation - action: "approve" - reasoning: Confirm the transformation matches your generation reasoning: * Does initial_table $\rightarrow$ final_table match the expected transformation? * Are row counts, aggregations, ...

  68. [68]

    **If you haven’t tried yet OR need to refine**: - action: "execute" - sql: Your citation SQL query - reasoning: Which rows you’re selecting and why

  69. [69]

    finalize

    **If your previous SQL result looks good**: - action: "finalize" - sql: "" (leave empty) - reasoning: Confirm that the previous result contains the right citation rows **Attempts remaining**: {{attempts_remaining}} Provide: - ‘reasoning‘: Explain your decision - ‘action‘: "execute" (try SQL) or "finalize" (accept previous result) - ‘sql‘: SQL query (requi...

  70. [70]

    A key/identifier (e.g., country name, industry name, or "value" for single numbers)

  71. [71]

    Artist A: 39 songs, Artist B: 25 songs

    The numeric value Return two lists: one for ground truth and one for generated answer. For single numeric answers, use "value" as the key. For multiple values like rankings or distributions, use the entity name (country, industry, etc.) as the key. SPECIAL HANDLING FOR RANKINGS: - If rankings have associated counts/values (e.g., "Artist A: 39 songs, Artis...