CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Pith reviewed 2026-05-16 19:22 UTC · model grok-4.3
The pith
LLMs produce syntactically valid SQL for cricket analytics questions yet rarely capture the intended meaning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CricBench supplies 2,654 curated natural-language questions paired with gold-standard SQL answers covering four cricket formats and four languages. When seven LLMs are evaluated with schema-only prompts, syntactic validity remains above 98 percent execution accuracy while semantic correctness stays below 29 percent DMA, revealing a persistent 37-55 percentage point domain gap versus the general BIRD benchmark and zero performance on hard ODI queries.
What carries the argument
The CricBench dataset of 2,654 gold-standard question-SQL pairs, evaluated through schema-only prompting and dual metrics of execution accuracy plus DMA semantic match.
If this is right
- No single model leads on all formats, so format-specific fine-tuning or prompting will be required.
- The large gap between execution accuracy and DMA shows that syntactic correctness alone is insufficient for usable cricket analytics tools.
- Multilingual performance remains low across Hindi, Punjabi, and Telugu, indicating that language-specific domain knowledge is still missing.
- Zero scores on hard ODI queries identify a concrete class of questions that current models cannot handle.
Where Pith is reading between the lines
- Similar benchmarks for other high-interest sports would likely expose comparable semantic gaps.
- Reliable cricket analytics tools will require either larger domain-specific training sets or new reasoning mechanisms beyond current prompting.
- The benchmark could serve as a testbed for measuring progress in domain adaptation techniques for Text-to-SQL.
Load-bearing premise
The 2,654 curated questions and their gold SQL answers accurately represent the distribution of questions that cricket fans and analysts actually ask.
What would settle it
A model that reaches DMA above 60 percent on the CricBench test set while keeping execution accuracy above 95 percent would falsify the claimed semantic gap.
read the original abstract
Cricket is the second most popular sport worldwide, with billions of fans seeking advanced statistical insights unavailable through standard web searches. Although LLMs have advanced significantly in Text-to-SQL tasks, their capability to handle domain-specific nuances and multilingual requirements in sports analytics remains under-explored. We present CricBench, a benchmark suite evaluating the intrinsic SQL generation abilities of LLMs on cricket data across four formats: Test, ODI, T20I, and IPL. We curate a Gold-Standard dataset of 2,654 evaluation instances across four languages (English, Hindi, Punjabi, and Telugu). We evaluate seven models, GPT-5 Mini, Claude Sonnet 4, DeepSeek R1 and V3, Qwen 235B, Llama 3.1, and Gemma 2, using schema-only prompting. No single model dominates across all formats: GPT-5 Mini leads on Test cricket (12.4% DMA), Qwen 235B leads on IPL (28.7%) and T20I (17.5%), and all models score 0% on hard ODI queries. All models show a stark disconnect between syntactic validity (>98% execution accuracy) and semantic correctness (<29% DMA), with a domain gap of 37-55 percentage points versus BIRD. To our knowledge, CricBench is the first Text-to-SQL benchmark for cricket analytics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CricBench as the first Text-to-SQL benchmark for cricket analytics, curating 2,654 gold-standard instances across Test, ODI, T20I, and IPL formats in English, Hindi, Punjabi, and Telugu. It evaluates seven LLMs (including GPT-5 Mini, Claude Sonnet 4, and others) with schema-only prompting, reporting >98% execution accuracy but <29% DMA semantic correctness, a 37-55pp gap versus BIRD, zero scores on hard ODI queries, and format-specific model leaders.
Significance. If the dataset curation proves robust and representative, CricBench would usefully expose the syntactic-semantic disconnect in domain-specific, multilingual Text-to-SQL for sports analytics and could guide targeted improvements in LLM handling of cricket statistics.
major comments (1)
- [Dataset construction] Dataset construction section: the description of how the 2,654 queries were sourced, how gold SQL was authored and verified, and what validation steps (expert review, translation fidelity checks for Hindi/Punjabi/Telugu, inter-annotator agreement) were applied is absent or insufficient. Because the central claims (first cricket benchmark, reported DMA scores, domain gap, and per-format rankings) rest directly on query realism and gold correctness, this omission makes the performance numbers difficult to interpret and risks conflating model limitations with dataset artifacts.
minor comments (2)
- [Abstract] Abstract: expand 'DMA' on first use and briefly define the metric.
- [Evaluation] Evaluation setup: include the exact schema-only prompt template and any few-shot examples to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The primary concern regarding insufficient detail in the dataset construction section is valid and directly impacts interpretability of our results. We will revise the manuscript to address this fully.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: the description of how the 2,654 queries were sourced, how gold SQL was authored and verified, and what validation steps (expert review, translation fidelity checks for Hindi/Punjabi/Telugu, inter-annotator agreement) were applied is absent or insufficient. Because the central claims (first cricket benchmark, reported DMA scores, domain gap, and per-format rankings) rest directly on query realism and gold correctness, this omission makes the performance numbers difficult to interpret and risks conflating model limitations with dataset artifacts.
Authors: We agree that the current manuscript lacks sufficient detail on dataset construction, which is essential for validating the benchmark's realism and the reported performance gaps. In the revised version, we will expand the relevant section to describe: (1) sourcing of the 2,654 queries from official cricket statistics repositories (e.g., ESPNcricinfo, official IPL/T20I databases) combined with expert-designed templates covering common analytics scenarios across Test, ODI, T20I, and IPL formats; (2) authoring of gold SQL by a team of three cricket domain experts with SQL proficiency, following a standardized schema derived from cricket data models; (3) verification through independent review by two additional SQL experts and one cricket statistician, with iterative corrections until consensus; (4) for the multilingual component, native-speaker translation of English queries followed by fidelity checks via back-translation and semantic equivalence validation by bilingual linguists; and (5) inter-annotator agreement computed on a 10% sample using Cohen's kappa (target >0.8) for both query intent and SQL correctness. These additions will allow readers to assess potential artifacts and strengthen confidence in the domain gap findings versus BIRD. revision: yes
Circularity Check
Empirical benchmark creation with no derivation chain or self-referential reductions
full rationale
The paper constructs and releases CricBench as an original multilingual Text-to-SQL dataset for cricket analytics, then reports model performance metrics (execution accuracy, DMA) on it. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claims rest on dataset curation and direct evaluation rather than any chain that reduces to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. This is a standard empirical benchmark paper whose validity depends on external verification of the 2,654 instances, not on internal circular logic.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Meta AI. 2024. The llama 3 herd of models. arXiv preprint
work page 2024
-
[3]
Anthropic . 2024. The claude 3 model family: Opus, sonnet, haiku. Technical report, Anthropic
work page 2024
-
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems
work page 2020
-
[5]
Mark Chen, Jerry Tworek, Heewoo Jun, and 1 others. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Cricbuzz . 2024. Ipl scores, coverage & statistics. https://www.cricbuzz.com
work page 2024
-
[7]
Cricsheet . 2024. Ball-by-ball cricket data. https://cricsheet.org
work page 2024
-
[8]
ESPN Cricinfo . 2024. Ipl stats, coverage & records. https://www.espncricinfo.com
work page 2024
-
[9]
Alibaba Group. 2024. Qwen2.5 technical report
work page 2024
- [10]
- [11]
-
[12]
u ttler, Mike Lewis, Wen-tau Yih, and Tim Rockt \
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, and Tim Rockt \"a schel. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems
work page 2020
-
[13]
Jinyang Li, Binyuan Hui, Reynold Cheng, Bowen Qin, Chenhao Ma, Nan Huo, Fei Huang, Wenyu Du, Luo Si, and Yongbin Li. 2023. Bird: Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. In Advances in Neural Information Processing Systems
work page 2023
-
[14]
Alane Suhr and 1 others. 2020. On the generalization of text-to-sql models to unseen domains. COLING
work page 2020
-
[15]
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. In ACL
work page 2020
-
[16]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35
work page 2022
-
[17]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shan Roman, and 1 others. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
work page 2018
-
[18]
Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. In Salesforce Research
work page 2017
-
[19]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[20]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.