Cheaper, Better, Faster, Stronger: Robust Text-to-SQL without Chain-of-Thought or Fine-Tuning

Andrea W Wen-Yi; David Mimno; Derek Hommel; Unso Eun Seo Jo; Yusuf Denizay D\"onder

arxiv: 2505.14174 · v2 · submitted 2025-05-20 · 💻 cs.CL · cs.LG

Cheaper, Better, Faster, Stronger: Robust Text-to-SQL without Chain-of-Thought or Fine-Tuning

Yusuf Denizay D\"onder , Derek Hommel , Andrea W Wen-Yi , David Mimno , Unso Eun Seo Jo This is my paper

Pith reviewed 2026-05-22 13:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords text-to-SQLlarge language modelsschema representationsconsistency methodscost-efficient inferenceBIRD benchmarknatural language to code

0 comments

The pith

Using multiple representations of the same schema lets smaller LLMs produce robust text-to-SQL outputs without chain-of-thought or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents N-rep consistency, a method that feeds an LLM several different views of the database schema and aggregates the resulting queries. This yields benchmark performance on BIRD comparable to far more expensive techniques that rely on reasoning chains or model training. Readers would care because the approach slashes per-query inference cost to roughly one-tenth of current state-of-the-art methods while using only off-the-shelf smaller models. The central idea is that diversity in schema presentation compensates for the weaknesses of any single view.

Core claim

N-rep consistency generates multiple distinct representations of the input schema, runs the same base LLM on each representation, and combines the outputs to produce a final SQL query. The method achieves similar accuracy to approaches that use chain-of-thought, self-consistency, or fine-tuning, yet requires no reasoning steps, no task-specific training, and only a small number of LLM calls per query.

What carries the argument

N-rep consistency, which creates varied schema representations of the same database and aggregates LLM outputs across those representations to improve robustness.

If this is right

Smaller and cheaper models can reach competitive text-to-SQL accuracy without fine-tuning.
Per-query costs drop to approximately $0.039 while matching more expensive methods.
Robustness improves on queries that expose weaknesses in any single schema view.
No chain-of-thought prompting or task-specific training is required to obtain the gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-representation idea may help other structured generation tasks where input formatting strongly affects output quality.
N-rep could be stacked with existing consistency techniques to explore further cost-accuracy trade-offs.
Practical deployments on modest hardware become more feasible for natural-language database interfaces.

Load-bearing premise

That combining outputs across multiple schema representations will reliably improve accuracy and robustness on unseen queries without introducing new failure modes.

What would settle it

Running N-rep and single-representation baselines on a new set of complex or out-of-distribution queries and finding that the multi-representation version shows no accuracy gain or higher error rates.

Figures

Figures reproduced from arXiv: 2505.14174 by Andrea W Wen-Yi, David Mimno, Derek Hommel, Unso Eun Seo Jo, Yusuf Denizay D\"onder.

**Figure 2.** Figure 2: Overview of the N-rep approach for Text-to-SQL generation. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of N-rep, Self Consistency with [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Upper and lower bounds of N-rep, Self Con [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: EX by vote count for the selected candidate [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: shows the system prompt for candidate generation. INSTRUCTIONS: You write SQL queries for a sqlite database. Users are querying their company database, and your task is to assist by generating valid SQL queries strictly adhering to the database schema provided. The user will provide you with a query intent, an SQL template, and optionally an hint to help create the correct SQL. They may also provide a set … view at source ↗

**Figure 7.** Figure 7: XiYan-SQL M-SCHEMA format # Table: account [ (account_id, account id.), (district_id, location of branch.), ] # Table: district [ (district_id, location of branch.), (A11, average salary.), ] # Table: loan [ (amount, amount.), (status, status. Value examples: [’C’, ’A’, ’D’, ’B’].) ] [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: MAC-SCHEMA format [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: DDL ("SQL CREATE") format table ’account’ with columns: account_id (INTEGER), district_id (INTEGER) table ’district’ with columns: district_id (INTEGER), A11 (INTEGER) table ’loan’ with columns: amount (INTEGER), status (TEXT) Relations: account.district_id -> district.district_id [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: DIN-SQL style format [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: raw JSON format [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: python SQLAlchemy format [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Schema linker output table ’users’ with columns: user_id (INTEGER), name (TEXT), email (TEXT), created_at (DATE) table ’products’ with columns: product_id (INTEGER), name (TEXT), price (DECIMAL), stock (INTEGER) table ’orders’ with columns: order_id (INTEGER), user_id (INTEGER), product_id (INTEGER), quantity (INTEGER), order_date (DATE) Relations: orders.user_id -> users.user_id orders.product_id -> prod… view at source ↗

**Figure 14.** Figure 14: No filtering example (full schema) table ’users’ with columns: user_id (INTEGER), name (TEXT), email (TEXT), created_at (DATE) table ’orders’ with columns: order_id (INTEGER), user_id (INTEGER), product_id (INTEGER), quantity (INTEGER), order_date (DATE) Relations: orders.user_id -> users.user_id [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: Table-only filtering example table ’users’ with columns: user_id (INTEGER), name (TEXT) table ’orders’ with columns: user_id (INTEGER), order_date (DATE) Relations: orders.user_id -> users.user_id [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 16.** Figure 16: Full filtering example [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

read the original abstract

LLMs are effective at code generation tasks like text-to-SQL, but is it worth the cost? Many state-of-the-art approaches use non-task-specific LLM techniques including Chain-of-Thought (CoT), self-consistency, and fine-tuning. These methods can be costly at inference time, sometimes requiring over a hundred LLM calls with reasoning, incurring average costs of up to \$0.46 per query, while fine-tuning models can cost thousands of dollars. We introduce "N-rep" consistency, a more cost-efficient text-to-SQL approach that achieves similar BIRD benchmark scores as other more expensive methods, at only \$0.039 per query. N-rep leverages multiple representations of the same schema input to mitigate weaknesses in any single representation, making the solution more robust and allowing the use of smaller and cheaper models without any reasoning or fine-tuning. To our knowledge, N-rep is the best-performing text-to-SQL approach in its cost range.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

N-rep consistency gives a practical low-cost prompting method for text-to-SQL by varying schema inputs to match expensive baselines on BIRD.

read the letter

The main thing to know is that this paper introduces N-rep consistency, which runs multiple schema representations through a small model and combines the outputs to reach BIRD scores close to chain-of-thought or fine-tuned approaches while dropping the per-query cost to $0.039 with no reasoning steps or tuning required. The core idea treats schema variation itself as the source of robustness rather than adding more calls or training. This is a legitimate new prompting variant that sidesteps the usual heavy techniques. The paper does well at framing the real deployment constraint around inference cost and showing that a simpler input change can close much of the gap. The efficiency numbers and direct comparison to costlier baselines are the useful part for anyone who has to run these systems at volume. The soft spots are minor but worth noting. The aggregation rule for the multiple outputs is described at a high level, so it is not yet clear how sensitive the gains are to that choice or whether it creates new failure modes on certain query types. A bit more on statistical significance and query-level breakdowns would make the robustness claim tighter. This paper is for practitioners building text-to-SQL tools or similar generation interfaces who care about keeping API bills down. A reader working on applied prompting or cost-aware NLP systems will get concrete value from the method and the reported tradeoff. It deserves a serious referee because the claim is testable, the setup is reproducible in principle, and the efficiency angle is worth checking in detail. I would send it out for review.

Referee Report

3 major / 2 minor

Summary. The paper introduces N-rep consistency, a prompting technique for text-to-SQL that generates multiple distinct representations of the same database schema, feeds each to a small LLM without Chain-of-Thought or self-consistency, and combines the resulting SQL outputs. It claims this yields BIRD benchmark execution accuracy comparable to far more expensive baselines (CoT, self-consistency, fine-tuning) at an average cost of $0.039 per query, with no task-specific tuning, and positions N-rep as the strongest performer in its cost range.

Significance. If the empirical results are robust, the work demonstrates that input diversity across schema representations can substitute for expensive reasoning chains or model adaptation in structured generation tasks. This has clear practical value for lowering inference costs in text-to-SQL deployments while preserving accuracy, and offers a generalizable insight for efficient prompting strategies in other code-generation domains.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the central claim of 'comparable BIRD scores' is not supported by any reported numerical values, standard deviations, baseline scores, or statistical significance tests in the provided description; without these, equivalence to $0.46/query methods cannot be verified.
[§3] §3 (Method): the combination rule for aggregating outputs across the N schema representations is described only at a high level; it is unclear whether this rule (e.g., majority vote, selection by execution success) is fixed or requires any validation on BIRD, which directly affects the claim of 'no task-specific tuning'.
[§4 and §5] §4 and §5: no error analysis or breakdown of failure modes is presented, leaving open whether N-rep introduces new errors on unseen queries that single-representation prompting avoids, which is load-bearing for the robustness claim.

minor comments (2)

[Abstract] The abstract would be strengthened by stating the exact BIRD execution accuracy achieved by N-rep and the number of representations used.
[§3] Notation for the multiple schema representations and the combination function should be formalized with equations or pseudocode for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each of the major comments point by point below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'comparable BIRD scores' is not supported by any reported numerical values, standard deviations, baseline scores, or statistical significance tests in the provided description; without these, equivalence to $0.46/query methods cannot be verified.

Authors: We agree that the abstract and §4 would benefit from more explicit numerical support. The full experimental results in §4 report BIRD execution accuracies for N-rep alongside the higher-cost baselines. We will revise the abstract to include the key accuracy figures and costs, add standard deviations from repeated runs where available, and note the observed differences to allow direct verification of comparability. revision: yes
Referee: [§3] §3 (Method): the combination rule for aggregating outputs across the N schema representations is described only at a high level; it is unclear whether this rule (e.g., majority vote, selection by execution success) is fixed or requires any validation on BIRD, which directly affects the claim of 'no task-specific tuning'.

Authors: The aggregation uses a fixed majority vote over the SQL outputs from the N representations, with execution success on the database used to resolve ties or filter invalid queries. This rule is predetermined and applied uniformly without any hyperparameter search or validation on the BIRD development set. We will expand the description in the revised §3 to make the exact procedure explicit and reaffirm the absence of task-specific tuning. revision: yes
Referee: [§4 and §5] §4 and §5: no error analysis or breakdown of failure modes is presented, leaving open whether N-rep introduces new errors on unseen queries that single-representation prompting avoids, which is load-bearing for the robustness claim.

Authors: We recognize that a dedicated error analysis would strengthen the robustness argument. We will add a subsection to §5 that categorizes failure modes on a sample of BIRD test queries, directly comparing N-rep outputs to single-representation baselines to show that N-rep mitigates representation-specific errors without introducing new failure types on unseen queries. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes an empirical prompting method (N-rep consistency) that feeds multiple schema representations into a small LLM and combines outputs, evaluated directly on the external BIRD benchmark. No mathematical derivation, equations, or parameter-fitting steps are present that reduce to self-definition or fitted inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core claim. The result is presented as an experimental outcome at a stated cost point, with the central performance comparison relying on external benchmarks rather than internal redefinition or circular reduction. This is a standard applied prompting paper with self-contained empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical assumption that multiple schema representations improve robustness for LLMs on text-to-SQL; this is a domain-specific prompting heuristic rather than a derived result.

axioms (1)

domain assumption LLMs produce more reliable structured outputs when given varied but equivalent input representations of the same underlying schema
Invoked to justify why N-rep mitigates single-representation weaknesses; stated in the abstract description of the method.

pith-pipeline@v0.9.0 · 5719 in / 1177 out tokens · 49672 ms · 2026-05-22T13:45:13.293346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

N-rep leverages multiple representations of the same schema input to mitigate weaknesses in any single representation... without any reasoning or fine-tuning.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

N-rep adopts a confidence-aware two-stage candidate selection strategy that combines regular self-consistency voting with CHASE-SQL’s LLM-based pair-wise voting.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints
cs.DB 2026-03 unverdicted novelty 7.0

SpotIt+ uses verification to find realistic counterexample databases that expose discrepancies between generated and gold SQL queries missed by standard test-based evaluation on the BIRD dataset.
Data-aware candidate selection in NL2SQL translation via small separating instances
cs.DB 2026-05 unverdicted novelty 6.0

A selection technique based on separating instances and provenance outperforms baselines for choosing among 2-3 NL2SQL candidates on a BIRD-DEV subset without consistency scores.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

E-sql: Direct schema linking via question enrichment in text-to-sql,

E-sql: Direct schema linking via question enrichment in text-to-sql.arXiv preprint arXiv:2409.16751. Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

work page arXiv
[2]

Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

Text-to-sql empowered by large language models: A benchmark evaluation.Preprint, arXiv:2308.15363. Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

work page arXiv
[3]

Xiyan-sql: A multi- generator ensemble framework for text-to-sql,

A preview of xiyan-sql: A multi-generator ensemble framework for text-to-sql. Preprint, arXiv:2411.08599. Google Gemini Team

work page arXiv
[4]

GPT-4o System Card

Gpt-4o system card.arXiv preprint arXiv:2410.21276. Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[5]

The death of schema linking? text-to-sql in the age of well-reasoned lan- guage models.arXiv preprint arXiv:2408.07702. OpenAI

work page arXiv
[6]

InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 8212–8220, Miami, Florida, USA

DTS-SQL: Decomposed text-to-SQL with small large language models. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 8212–8220, Miami, Florida, USA. Association for Computational Linguistics. Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, Sercan Arik, and 1 others. 2...

work page arXiv 2024
[7]

InFindings of the Association for Computational Linguistics: ACL 2024, pages 5456–5471, Bangkok, Thailand

Before generation, align it! a novel and effective strategy for mitigating hallucinations in text-to-SQL generation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 5456–5471, Bangkok, Thailand. Association for Computational Linguistics. Nils Reimers, Elliott Choi, Amr Kayid, Alekhya Nan- dula, Manoj Govindassamy, and Abdullah Elkady

work page 2024
[8]

Accessed: 2025-05-

Introducing embed v3. Accessed: 2025-05-

work page 2025
[9]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

Chess: Contextual harnessing for efficient sql synthe- sis.Preprint, arXiv:2405.16755. Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530. Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Ji- aqi Bai, LinZheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, and Zhoujun Li

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Qwen3 Technical Report

Qwen3 technical report.Preprint, arXiv:2505.09388. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev

work page internal anchor Pith review Pith/arXiv arXiv
[12]

InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing, pages 3911–3921, Brussels, Bel- gium

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-SQL task. InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing, pages 3911–3921, Brussels, Bel- gium. Association for Computational Linguistics. A Methodology for performance and cost analysis A.1 Models Chosen For ...

work page 2018
[13]

"), e.g.,

For the o3-mini experiments, we used Azure OpenAI Service pricing of $1.10 per 1M input tokens and $4.40 per 1M output to- kens. For E-SQL, which used GPT-4o (Hurst et al., 2024), $2.50 per 1M input tokens and $10.00 per 1M output tokens. For Gemini 2.5 Pro and CHESS, which uses Gemini 1.5 Pro , we applied Gemini Developer API pricing: $1.25 per 1M input ...

work page 2024

[1] [1]

E-sql: Direct schema linking via question enrichment in text-to-sql,

E-sql: Direct schema linking via question enrichment in text-to-sql.arXiv preprint arXiv:2409.16751. Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

work page arXiv

[2] [2]

Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

Text-to-sql empowered by large language models: A benchmark evaluation.Preprint, arXiv:2308.15363. Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

work page arXiv

[3] [3]

Xiyan-sql: A multi- generator ensemble framework for text-to-sql,

A preview of xiyan-sql: A multi-generator ensemble framework for text-to-sql. Preprint, arXiv:2411.08599. Google Gemini Team

work page arXiv

[4] [4]

GPT-4o System Card

Gpt-4o system card.arXiv preprint arXiv:2410.21276. Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

The death of schema linking? text-to-sql in the age of well-reasoned lan- guage models.arXiv preprint arXiv:2408.07702. OpenAI

work page arXiv

[6] [6]

InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 8212–8220, Miami, Florida, USA

DTS-SQL: Decomposed text-to-SQL with small large language models. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 8212–8220, Miami, Florida, USA. Association for Computational Linguistics. Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, Sercan Arik, and 1 others. 2...

work page arXiv 2024

[7] [7]

InFindings of the Association for Computational Linguistics: ACL 2024, pages 5456–5471, Bangkok, Thailand

Before generation, align it! a novel and effective strategy for mitigating hallucinations in text-to-SQL generation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 5456–5471, Bangkok, Thailand. Association for Computational Linguistics. Nils Reimers, Elliott Choi, Amr Kayid, Alekhya Nan- dula, Manoj Govindassamy, and Abdullah Elkady

work page 2024

[8] [8]

Accessed: 2025-05-

Introducing embed v3. Accessed: 2025-05-

work page 2025

[9] [9]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

Chess: Contextual harnessing for efficient sql synthe- sis.Preprint, arXiv:2405.16755. Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530. Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Ji- aqi Bai, LinZheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, and Zhoujun Li

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Qwen3 Technical Report

Qwen3 technical report.Preprint, arXiv:2505.09388. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing, pages 3911–3921, Brussels, Bel- gium

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-SQL task. InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing, pages 3911–3921, Brussels, Bel- gium. Association for Computational Linguistics. A Methodology for performance and cost analysis A.1 Models Chosen For ...

work page 2018

[13] [13]

"), e.g.,

For the o3-mini experiments, we used Azure OpenAI Service pricing of $1.10 per 1M input tokens and $4.40 per 1M output to- kens. For E-SQL, which used GPT-4o (Hurst et al., 2024), $2.50 per 1M input tokens and $10.00 per 1M output tokens. For Gemini 2.5 Pro and CHESS, which uses Gemini 1.5 Pro , we applied Gemini Developer API pricing: $1.25 per 1M input ...

work page 2024