pith. sign in

arxiv: 2512.21877 · v3 · submitted 2025-12-26 · 💻 cs.CL · cs.AI

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Pith reviewed 2026-05-16 19:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords cricket analyticsText-to-SQLLLM evaluationmultilingual benchmarksemantic correctnesssports dataSQL generationdomain gap
0
0 comments X

The pith

LLMs produce syntactically valid SQL for cricket analytics questions yet rarely capture the intended meaning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CricBench as the first Text-to-SQL benchmark built specifically for cricket data across Test, ODI, T20I, and IPL formats. It supplies 2,654 gold-standard question-SQL pairs in English, Hindi, Punjabi, and Telugu, then tests seven current models under schema-only prompting. The central finding is that execution accuracy exceeds 98 percent while semantic correctness measured by DMA stays below 29 percent, producing a 37-to-55-point gap relative to the general BIRD benchmark. No model leads across every format, and all models score zero on the hardest ODI questions. The work therefore establishes a concrete domain gap between general SQL generation and the semantic demands of real cricket analytics.

Core claim

CricBench supplies 2,654 curated natural-language questions paired with gold-standard SQL answers covering four cricket formats and four languages. When seven LLMs are evaluated with schema-only prompts, syntactic validity remains above 98 percent execution accuracy while semantic correctness stays below 29 percent DMA, revealing a persistent 37-55 percentage point domain gap versus the general BIRD benchmark and zero performance on hard ODI queries.

What carries the argument

The CricBench dataset of 2,654 gold-standard question-SQL pairs, evaluated through schema-only prompting and dual metrics of execution accuracy plus DMA semantic match.

If this is right

  • No single model leads on all formats, so format-specific fine-tuning or prompting will be required.
  • The large gap between execution accuracy and DMA shows that syntactic correctness alone is insufficient for usable cricket analytics tools.
  • Multilingual performance remains low across Hindi, Punjabi, and Telugu, indicating that language-specific domain knowledge is still missing.
  • Zero scores on hard ODI queries identify a concrete class of questions that current models cannot handle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar benchmarks for other high-interest sports would likely expose comparable semantic gaps.
  • Reliable cricket analytics tools will require either larger domain-specific training sets or new reasoning mechanisms beyond current prompting.
  • The benchmark could serve as a testbed for measuring progress in domain adaptation techniques for Text-to-SQL.

Load-bearing premise

The 2,654 curated questions and their gold SQL answers accurately represent the distribution of questions that cricket fans and analysts actually ask.

What would settle it

A model that reaches DMA above 60 percent on the CricBench test set while keeping execution accuracy above 95 percent would falsify the claimed semantic gap.

read the original abstract

Cricket is the second most popular sport worldwide, with billions of fans seeking advanced statistical insights unavailable through standard web searches. Although LLMs have advanced significantly in Text-to-SQL tasks, their capability to handle domain-specific nuances and multilingual requirements in sports analytics remains under-explored. We present CricBench, a benchmark suite evaluating the intrinsic SQL generation abilities of LLMs on cricket data across four formats: Test, ODI, T20I, and IPL. We curate a Gold-Standard dataset of 2,654 evaluation instances across four languages (English, Hindi, Punjabi, and Telugu). We evaluate seven models, GPT-5 Mini, Claude Sonnet 4, DeepSeek R1 and V3, Qwen 235B, Llama 3.1, and Gemma 2, using schema-only prompting. No single model dominates across all formats: GPT-5 Mini leads on Test cricket (12.4% DMA), Qwen 235B leads on IPL (28.7%) and T20I (17.5%), and all models score 0% on hard ODI queries. All models show a stark disconnect between syntactic validity (>98% execution accuracy) and semantic correctness (<29% DMA), with a domain gap of 37-55 percentage points versus BIRD. To our knowledge, CricBench is the first Text-to-SQL benchmark for cricket analytics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces CricBench as the first Text-to-SQL benchmark for cricket analytics, curating 2,654 gold-standard instances across Test, ODI, T20I, and IPL formats in English, Hindi, Punjabi, and Telugu. It evaluates seven LLMs (including GPT-5 Mini, Claude Sonnet 4, and others) with schema-only prompting, reporting >98% execution accuracy but <29% DMA semantic correctness, a 37-55pp gap versus BIRD, zero scores on hard ODI queries, and format-specific model leaders.

Significance. If the dataset curation proves robust and representative, CricBench would usefully expose the syntactic-semantic disconnect in domain-specific, multilingual Text-to-SQL for sports analytics and could guide targeted improvements in LLM handling of cricket statistics.

major comments (1)
  1. [Dataset construction] Dataset construction section: the description of how the 2,654 queries were sourced, how gold SQL was authored and verified, and what validation steps (expert review, translation fidelity checks for Hindi/Punjabi/Telugu, inter-annotator agreement) were applied is absent or insufficient. Because the central claims (first cricket benchmark, reported DMA scores, domain gap, and per-format rankings) rest directly on query realism and gold correctness, this omission makes the performance numbers difficult to interpret and risks conflating model limitations with dataset artifacts.
minor comments (2)
  1. [Abstract] Abstract: expand 'DMA' on first use and briefly define the metric.
  2. [Evaluation] Evaluation setup: include the exact schema-only prompt template and any few-shot examples to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The primary concern regarding insufficient detail in the dataset construction section is valid and directly impacts interpretability of our results. We will revise the manuscript to address this fully.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the description of how the 2,654 queries were sourced, how gold SQL was authored and verified, and what validation steps (expert review, translation fidelity checks for Hindi/Punjabi/Telugu, inter-annotator agreement) were applied is absent or insufficient. Because the central claims (first cricket benchmark, reported DMA scores, domain gap, and per-format rankings) rest directly on query realism and gold correctness, this omission makes the performance numbers difficult to interpret and risks conflating model limitations with dataset artifacts.

    Authors: We agree that the current manuscript lacks sufficient detail on dataset construction, which is essential for validating the benchmark's realism and the reported performance gaps. In the revised version, we will expand the relevant section to describe: (1) sourcing of the 2,654 queries from official cricket statistics repositories (e.g., ESPNcricinfo, official IPL/T20I databases) combined with expert-designed templates covering common analytics scenarios across Test, ODI, T20I, and IPL formats; (2) authoring of gold SQL by a team of three cricket domain experts with SQL proficiency, following a standardized schema derived from cricket data models; (3) verification through independent review by two additional SQL experts and one cricket statistician, with iterative corrections until consensus; (4) for the multilingual component, native-speaker translation of English queries followed by fidelity checks via back-translation and semantic equivalence validation by bilingual linguists; and (5) inter-annotator agreement computed on a 10% sample using Cohen's kappa (target >0.8) for both query intent and SQL correctness. These additions will allow readers to assess potential artifacts and strengthen confidence in the domain gap findings versus BIRD. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark creation with no derivation chain or self-referential reductions

full rationale

The paper constructs and releases CricBench as an original multilingual Text-to-SQL dataset for cricket analytics, then reports model performance metrics (execution accuracy, DMA) on it. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claims rest on dataset curation and direct evaluation rather than any chain that reduces to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. This is a standard empirical benchmark paper whose validity depends on external verification of the 2,654 instances, not on internal circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the curated queries form a representative gold standard for cricket analytics needs. No free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5588 in / 1118 out tokens · 30049 ms · 2026-05-16T19:22:34.035721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  2. [2]

    Meta AI. 2024. The llama 3 herd of models. arXiv preprint

  3. [3]

    Anthropic . 2024. The claude 3 model family: Opus, sonnet, haiku. Technical report, Anthropic

  4. [4]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems

  5. [5]

    Mark Chen, Jerry Tworek, Heewoo Jun, and 1 others. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

  6. [6]

    Cricbuzz . 2024. Ipl scores, coverage & statistics. https://www.cricbuzz.com

  7. [7]

    Cricsheet . 2024. Ball-by-ball cricket data. https://cricsheet.org

  8. [8]

    ESPN Cricinfo . 2024. Ipl stats, coverage & records. https://www.espncricinfo.com

  9. [9]

    Alibaba Group. 2024. Qwen2.5 technical report

  10. [10]

    Pengcheng He and 1 others. 2019. Xsql: Reinforce schema representation with context. arXiv preprint arXiv:1908.08113

  11. [11]

    Richard Hipp

    D. Richard Hipp. 2024. SQLite . https://www.sqlite.org

  12. [12]

    u ttler, Mike Lewis, Wen-tau Yih, and Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, and Tim Rockt \"a schel. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems

  13. [13]

    Jinyang Li, Binyuan Hui, Reynold Cheng, Bowen Qin, Chenhao Ma, Nan Huo, Fei Huang, Wenyu Du, Luo Si, and Yongbin Li. 2023. Bird: Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. In Advances in Neural Information Processing Systems

  14. [14]

    Alane Suhr and 1 others. 2020. On the generalization of text-to-sql models to unseen domains. COLING

  15. [15]

    Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. In ACL

  16. [16]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35

  17. [17]

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shan Roman, and 1 others. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

  18. [18]

    Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. In Salesforce Research

  19. [19]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  20. [20]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...