pith. sign in

arxiv: 2604.26986 · v1 · submitted 2026-04-28 · 💻 cs.CL

BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task

Pith reviewed 2026-05-07 16:10 UTC · model grok-4.3

classification 💻 cs.CL
keywords digital battery passportconformance classificationbenchmark datasetlanguage model evaluationEU battery regulationsynthetic datazero-shot inferenceprompt injection
0
0 comments X

The pith

The paper introduces the first public dataset for classifying whether digital battery passports conform to regulatory standards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors establish a new task of digital battery passport conformance classification because the EU regulation requiring these passports is about to take effect with no existing public data for automation. They generate BatteryPass-12K synthetically from real pilot samples to create the initial benchmark and evaluate 22 language models across zero-shot and few-shot inference. Results indicate that models with explicit reasoning steps achieve the highest scores while simply increasing parameter count does not reliably improve outcomes. Few-shot prompting raises performance but prompt-injection attacks reduce it, and the dataset may extend to related battery-domain tasks such as lifecycle analysis.

Core claim

The central claim is that BatteryPass-12K supplies the first public benchmark for DBP conformance classification, generated synthetically from real pilot samples, and that evaluations of 22 language models reveal thinking models reaching 0.98 F1 on validation and 0.71 on test while few-shot examples help, scaling alone does not, and prompt attacks degrade results.

What carries the argument

The synthetic generation process that turns real pilot battery passport samples into labeled conformance classification examples.

If this is right

  • Thinking models outperform both smaller LMs and many larger dense models on the task.
  • Adding a few labeled examples raises average performance across model sizes.
  • Generally capable frontier models still struggle with the classification.
  • Parameter scaling alone does not guarantee better results since some small LMs beat certain LLMs.
  • Prompt-injection attacks reduce classification accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated tools built on this benchmark could help manufacturers prepare compliance systems ahead of the EU deadline.
  • The gap between validation and test performance points to the value of collecting more diverse real-world passport data over time.
  • The synthetic-from-pilot method may transfer to other emerging regulatory documentation tasks in energy storage or supply chains.

Load-bearing premise

The synthetic examples drawn from pilot samples accurately reflect the distribution and edge cases that appear in actual deployed digital battery passports.

What would settle it

Running the same models on a new collection of real, non-pilot digital battery passports and finding substantially lower F1 scores or different error patterns than on BatteryPass-12K.

Figures

Figures reproduced from arXiv: 2604.26986 by Lama Alkhaled, Marcus Liwicki, Martin Karlsson, Tosin Adewumi.

Figure 1
Figure 1. Figure 1: Flowchart of the BatteryPass-12K data and metadata generation pipeline. view at source ↗
Figure 2
Figure 2. Figure 2: Scaling parameters vs performance on the validation set. Using 2-sigma error bars based view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrices for the test set. Correctly predicting conformant samples is more view at source ↗
Figure 4
Figure 4. Figure 4: Confusion matrices of adversarial attacks for the test set with GPT-5.4 view at source ↗
read the original abstract

We introduce a novel task of digital battery passport (DBP) conformance classification and introduce the first public benchmark for the task: BatteryPass-12K, created synthetically from real pilot samples. This is as the EU's battery regulation on DBPs comes into effect soon and there exists no public dataset. We evaluated 22 language models (LMs) in zero-shot inference, spanning small LMs (SLMs), mixture of experts (MoEs), and dense LLMs. We also conducted analysis, additional evaluations of few-shot inference and prompt-injection attacks to find that (1) Thinking models have the best performance (with GPT-5.4 scoring 0.98 (0.03) and 0.71 (0.22) on average as F1 (and confidence interval at 95%) on the validation and test sets, respectively), (2) few-shot examples improve performance significantly, (3) generally capable frontier models find the task challenging, (4) merely scaling model parameters does not necessarily lead to improved performance, as SLMs outperformed some LLMs, and (5) prompt-injection attacks degrade performance. We note that BatteryPass-12K, though limited to real pilot samples, may be useful for other known or emerging tasks in the battery domain, e.g. lifecycle reasoning. We publicly release the dataset under a permissive licence (CC-BY-4.0).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a novel task of digital battery passport (DBP) conformance classification and presents BatteryPass-12K as the first public benchmark for it, synthetically generated from real pilot samples under the upcoming EU battery regulation. It evaluates 22 language models (SLMs, MoEs, dense LLMs) in zero-shot inference, reports F1 scores with 95% CIs (e.g., GPT-5.4 at 0.98/0.71 on validation/test), analyzes few-shot improvements and prompt-injection degradation, and publicly releases the dataset under CC-BY-4.0, suggesting utility for related battery tasks like lifecycle reasoning.

Significance. If the synthetic data faithfully represents real-world DBP conformance distributions and edge cases, the work fills a timely gap by providing the first public benchmark for an emerging regulatory task with no prior public datasets. The broad model evaluation, few-shot and adversarial analyses, and permissive release are strengths that could support further research in the battery domain. The empirical findings (thinking models best, scaling not always helpful, prompt injection harmful) offer practical insights conditional on dataset quality.

major comments (3)
  1. [Dataset creation section] Dataset creation section: The synthesis process from real pilot samples is described only at a high level with no enumeration of rules, templates, generation parameters, or coverage of rare conformance failures. This is load-bearing for the central claim that BatteryPass-12K is a valid first public benchmark, as reproducibility and edge-case assessment are impossible without these details.
  2. [Evaluation section] Evaluation section: No quantitative validation metrics (e.g., statistical distance such as KL divergence or Wasserstein distance on label distributions, feature statistics, or held-out real samples) are reported to support the representativeness assumption. All model rankings and F1 results (including 0.98/0.71 for GPT-5.4 and SLM vs. LLM comparisons) are therefore conditional on an untested assumption that directly affects benchmark validity.
  3. [§4 (Results and Analysis)] §4 (Results and Analysis): The reported splits, exact label balance, and how the 12K samples were divided into validation/test sets are not detailed, nor is any audit of edge-case coverage. This undermines the cross-model and few-shot claims, as performance could be an artifact of synthetic construction rather than task difficulty.
minor comments (2)
  1. [Abstract] Abstract: The phrasing 'This is as the EU's battery regulation on DBPs comes into effect soon' is grammatically unclear and should be revised for readability.
  2. [Results section] The manuscript would benefit from a table summarizing the 22 models by type (SLM/MoE/LLM), parameter count, and key F1 results for quick reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the timeliness of introducing the first public benchmark for digital battery passport conformance classification. We address each major comment point by point below, proposing specific revisions to improve clarity, reproducibility, and transparency while preserving the manuscript's core contributions.

read point-by-point responses
  1. Referee: [Dataset creation section] Dataset creation section: The synthesis process from real pilot samples is described only at a high level with no enumeration of rules, templates, generation parameters, or coverage of rare conformance failures. This is load-bearing for the central claim that BatteryPass-12K is a valid first public benchmark, as reproducibility and edge-case assessment are impossible without these details.

    Authors: We agree that greater detail on the synthesis process is necessary for full reproducibility. In the revised manuscript, we will expand the Dataset Creation section to enumerate the specific rules, templates, and generation parameters derived from the real pilot samples. We will also add a subsection on edge-case coverage, describing how rare conformance failures observed in the pilots were incorporated into the synthetic generation. The complete BatteryPass-12K dataset remains publicly available under CC-BY-4.0, enabling independent inspection and verification of the resulting samples. revision: yes

  2. Referee: [Evaluation section] Evaluation section: No quantitative validation metrics (e.g., statistical distance such as KL divergence or Wasserstein distance on label distributions, feature statistics, or held-out real samples) are reported to support the representativeness assumption. All model rankings and F1 results (including 0.98/0.71 for GPT-5.4 and SLM vs. LLM comparisons) are therefore conditional on an untested assumption that directly affects benchmark validity.

    Authors: We acknowledge that explicit quantitative validation metrics would strengthen claims of representativeness. However, the synthesis was performed from a limited set of proprietary real pilot samples, precluding computation of distances such as KL divergence or Wasserstein against a large held-out real distribution. In the revision, we will add a Limitations subsection that explicitly states this assumption, reports the observed label distributions and feature statistics within BatteryPass-12K, and compares them where possible to regulatory expectations. We will also qualify all performance claims accordingly while noting that the grounded synthetic process provides a practical first benchmark for this novel regulatory task. revision: partial

  3. Referee: [§4 (Results and Analysis)] §4 (Results and Analysis): The reported splits, exact label balance, and how the 12K samples were divided into validation/test sets are not detailed, nor is any audit of edge-case coverage. This undermines the cross-model and few-shot claims, as performance could be an artifact of synthetic construction rather than task difficulty.

    Authors: We will revise §4 to provide full details on the data splits, including the exact label balance across the 12K samples and the division methodology (stratified random split with explicit proportions for validation and test sets). We will also include an audit of edge-case coverage, mapping how conformance failure types from the pilot samples are represented in each split. These additions will clarify that observed performance differences, including few-shot gains and model comparisons, reflect task characteristics rather than construction artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset creation and empirical benchmarking stand independently

full rationale

The paper introduces a novel DBP conformance classification task and releases BatteryPass-12K, a synthetic dataset derived from real pilot samples, followed by zero-shot and few-shot evaluations of 22 language models plus prompt-injection tests. No derivation chain, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The central claim (first public benchmark for the task) rests on the act of synthesis and public release under CC-BY-4.0, which is self-contained and externally verifiable without reference to prior author results. The representativeness of the synthetic data is an empirical assumption open to external falsification, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that synthetic data from pilot samples is representative and that language models can meaningfully classify regulatory conformance text.

axioms (2)
  • ad hoc to paper Synthetic data generated from real pilot samples is representative of real-world DBP conformance cases
    Invoked in the creation of BatteryPass-12K as described in the abstract
  • domain assumption Language models can perform zero-shot and few-shot text classification on regulatory documents
    Basis for evaluating 22 LMs including SLMs, MoEs, and LLMs

pith-pipeline@v0.9.0 · 5564 in / 1318 out tokens · 88848 ms · 2026-05-07T16:10:30.267244+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

  1. [1]

    Feasibility of meeting future battery demand via domestic cell production in europe,

    S. Link, L. Schneider, A. Stephan, L. Weymann, and P. Plötz, “Feasibility of meeting future battery demand via domestic cell production in europe,”Nature Energy, vol. 10, no. 4, pp. 526–534, 2025

  2. [2]

    Regulation (eu) 2023/1542 of the european parliament and of the council,

    Act, “Regulation (eu) 2023/1542 of the european parliament and of the council,”Regulation (eu), 2023

  3. [3]

    Towards to battery digital passport: reviewing regulations and standards for second-life batteries,

    C. A. Rufino Júnior, E. Riva Sanseverino, P. Gallo, D. Koch, S. Diel, G. Walter, L. Trilla, V . J. Ferreira, G. B. Pérez, Y . Kotaket al., “Towards to battery digital passport: reviewing regulations and standards for second-life batteries,”Batteries, vol. 10, no. 4, p. 115, 2024

  4. [4]

    The quest for more circular battery value chains: Implementing the eu digital battery passport and remaining challenges,

    R. Losa and S. Torjesen, “The quest for more circular battery value chains: Implementing the eu digital battery passport and remaining challenges,”Cleaner Production Letters, p. 100118, 2025

  5. [5]

    Contradictions and inconsistencies in regulatory documents–a qualitative assessment from practice,

    G. Schumann, “Contradictions and inconsistencies in regulatory documents–a qualitative assessment from practice,” in2025 15th International Conference on Advanced Computer Information Technologies (ACIT). IEEE, 2025, pp. 273–282

  6. [6]

    Automating dataset updates towards reliable and timely evaluation of large language models,

    J. Ying, Y . Cao, Y . Bai, Q. Sun, B. Wang, W. Tang, Z. Ding, Y . Yang, X. Huang, and S. Yan, “Automating dataset updates towards reliable and timely evaluation of large language models,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates,...

  7. [7]

    Generate, annotate, and learn: Nlp with synthetic text,

    X. He, I. Nassar, J. Kiros, G. Haffari, and M. Norouzi, “Generate, annotate, and learn: Nlp with synthetic text,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 826–842, 2022

  8. [8]

    Synthetic data generation using large language models: Advances in text and code,

    M. Nad˘as,, L. Dios, an, and A. Tomescu, “Synthetic data generation using large language models: Advances in text and code,”IEEE Access, 2025. 16

  9. [9]

    Genai synthetic data create ethical challenges for scientists. here’s how to address them

    D. B. Resnik, M. Hosseini, J. J. Kim, G. Epiphaniou, and C. Maple, “Genai synthetic data create ethical challenges for scientists. here’s how to address them.”Proceedings of the National Academy of Sciences, vol. 122, no. 9, p. e2409182122, 2025

  10. [10]

    Machine actionable metadata models,

    D. Batista, A. Gonzalez-Beltran, S.-A. Sansone, and P. Rocca-Serra, “Machine actionable metadata models,” Scientific Data, vol. 9, no. 1, p. 592, 2022

  11. [11]

    Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks,

    A. Szymanski, N. Ziems, H. A. Eicher-Miller, T. J.-J. Li, M. Jiang, and R. A. Metoyer, “Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks,” inProceedings of the 30th International Conference on Intelligent User Interfaces, 2025, pp. 952–966

  12. [12]

    Judging llm-as-a-judge with mt-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

  13. [13]

    Data statements for natural language processing: Toward mitigating system bias and enabling better science,

    E. M. Bender and B. Friedman, “Data statements for natural language processing: Toward mitigating system bias and enabling better science,”Transactions of the Association for Computational Linguistics, vol. 6, pp. 587–604, 2018. [Online]. Available: https://aclanthology.org/Q18-1041/

  14. [14]

    Complementarity, f-score, and nlp evaluation,

    L. Derczynski, “Complementarity, f-score, and nlp evaluation,” inProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 261–266

  15. [15]

    A review of evaluation metrics in machine learning algorithms,

    G. Naidu, T. Zuva, and E. M. Sibanda, “A review of evaluation metrics in machine learning algorithms,” in Computer science on-line conference. Springer, 2023, pp. 15–25

  16. [16]

    Scikit-learn: Machine learning in python,

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,”the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011

  17. [17]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  18. [18]

    RAFT: A real-world few-shot text classification benchmark,

    N. Alex, E. Lifland, L. Tunstall, A. Thakur, P. Maham, C. J. Riedel, E. Hine, C. Ashurst, P. Sedille, A. Carlier, M. Noetel, and A. Stuhlmüller, “RAFT: A real-world few-shot text classification benchmark,” in NeurIPS Datasets and Benchmarks, 2021. [Online]. Available: https://datasets-benchmarks-proceedings. neurips.cc/paper/2021/file/ca46c1b9512a7a8315fa...

  19. [19]

    Clues: Few-shot learning evaluation in natural language understanding,

    S. Mukherjee, X. Liu, G. Zheng, S. Hosseini, H. Cheng, G. Yang, C. Meek, A. Awadallah, and J. Gao, “Clues: Few-shot learning evaluation in natural language understanding,” inThirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021. [Online]. Available: https://datasets-benchmarks-proceedings.ne...

  20. [20]

    True few-shot learning with language models,

    E. Perez, D. Kiela, and K. Cho, “True few-shot learning with language models,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 11 054–11 070. [Online]. Available: https: //proceedings.neurips.cc/paper_files/paper/2021/file/5c0492567...

  21. [21]

    A systematic review of prompt injection attacks on large language models: Trends, taxonomy, evaluation, defenses and opportunities,

    J. D. Duarte, G. D. Cândido, J. R. A. De Britto Filho, J. S. Neto, E. J. Costa, J. P. J. Da Costa, and L. P. De Melo, “A systematic review of prompt injection attacks on large language models: Trends, taxonomy, evaluation, defenses and opportunities,”IEEE Access, 2026

  22. [22]

    The synthetic data vault,

    N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data vault,” in2016 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 2016, pp. 399–410

  23. [23]

    Learning to generate synthetic data via compositing,

    S. Tripathi, S. Chandra, A. Agrawal, A. Tyagi, J. M. Rehg, and V . Chari, “Learning to generate synthetic data via compositing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 461–470

  24. [24]

    Synthetic data generation: A comparative study,

    M. Endres, A. Mannarapotta Venugopal, and T. S. Tran, “Synthetic data generation: A comparative study,” inProceedings of the 26th international database engineered applications symposium, 2022, pp. 94–102

  25. [25]

    Synthetic data generation with large language models for text classification: Potential and limitations,

    Z. Li, H. Zhu, Z. Lu, and M. Yin, “Synthetic data generation with large language models for text classification: Potential and limitations,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 10 443–10 461. 17

  26. [26]

    On llms-driven synthetic data gen- eration, curation, and evaluation: A survey,

    L. Long, R. Wang, R. Xiao, J. Zhao, X. Ding, G. Chen, and H. Wang, “On llms-driven synthetic data gen- eration, curation, and evaluation: A survey,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 11 065–11 082

  27. [27]

    Sequential subset matching for dataset distillation,

    J. Du, Q. Shi, and J. T. Zhou, “Sequential subset matching for dataset distillation,”Advances in Neural Information Processing Systems, vol. 36, pp. 67 487–67 504, 2023

  28. [28]

    Helpsteer 2: Open-source dataset for training top-performing reward models,

    Z. Wang, Y . Dong, O. Delalleau, J. Zeng, G. Shen, D. Egert, J. J. Zhang, M. N. Sreedhar, and O. Kuchaiev, “Helpsteer 2: Open-source dataset for training top-performing reward models,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, In...

  29. [29]

    Improving LLM-as-a-judge inference with the judgment distribution,

    V . Wang, M. J. Zhang, and E. Choi, “Improving LLM-as-a-judge inference with the judgment distribution,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 23 173–23 199. [Online]. Available: htt...

  30. [30]

    Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering,

    R. Wang, J. Guo, C. Gao, G. Fan, C. Y . Chong, and X. Xia, “Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1955–1977, 2025

  31. [31]

    From generation to judgment: Opportunities and challenges of LLM-as-a-judge,

    D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y . Jiang, C. Chen, T. Wu, K. Shu, L. Cheng, and H. Liu, “From generation to judgment: Opportunities and challenges of LLM-as-a-judge,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Pen...

  32. [32]

    Ask me like I’m human: LLM-based evaluation with for-human instructions correlates better with human evaluations than human judges,

    R. Huidrom and A. Belz, “Ask me like I’m human: LLM-based evaluation with for-human instructions correlates better with human evaluations than human judges,” inProceedings of the 4th Table Representation Learning Workshop, S. Chang, M. Hulsebos, Q. Liu, W. Chen, and H. Sun, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 98...

  33. [33]

    Training compute-optimal large language models,

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” inAdvances in Neural Inform...

  34. [34]

    Observational scaling laws and the predictability of langauge model performance,

    Y . Ruan, C. J. Maddison, and T. Hashimoto, “Observational scaling laws and the predictability of langauge model performance,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 15 841–15 892. [Online]. Available: https://...

  35. [35]

    On the codesign of scientific experiments and industrial systems,

    T. Dorigo, P. Vischia, S. Abbas, T. Adewumi, L. Alkhaled, L. Arsini, M. Awais, M. Borisyak, A. Bóta, F. Buryet al., “On the codesign of scientific experiments and industrial systems,”arXiv preprint arXiv:2603.26613, 2026

  36. [36]

    Compute optimal scaling of skills: Knowledge vs reasoning,

    N. Roberts, N. S. Chatterji, S. Narang, M. Lewis, and D. Hupkes, “Compute optimal scaling of skills: Knowledge vs reasoning,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 13 295–13 316. [Online]. Ava...

  37. [37]

    On the limitations of large language models (LLMs): False attribution,

    T. Adewumi, N. Habib, L. Alkhaled, and E. Barney, “On the limitations of large language models (LLMs): False attribution,” inProceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, G. Angelova, M. Kunilovskaya, M. Escribe, and R. Mitkov, Eds. Varna, Bulgaria...

  38. [38]

    Ai must not be fully autonomous,

    T. Adewumi, L. Alkhaled, F. Imbert, H. Han, N. Habib, and K. Löwenmark, “Ai must not be fully autonomous,”arXiv preprint arXiv:2507.23330, 2025. 18

  39. [39]

    Judging the judges: A systematic study of position bias in llm-as-a-judge,

    L. Shi, C. Ma, W. Liang, X. Diao, W. Ma, and S. V osoughi, “Judging the judges: A systematic study of position bias in llm-as-a-judge,” inProceedings of the 14th International Joint Conference on Natural Lan- guage Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 292–314

  40. [40]

    The javascript object notation (json) data interchange format,

    T. Bray, “The javascript object notation (json) data interchange format,” https://www.rfc-editor.org/rfc/ rfc8259, 2017, rFC 8259

  41. [41]

    Interoperability for provenance-aware databases using {PROV} and {JSON},

    X. Niu, B. Glavic, D. Gawlick, Z. H. Liu, V . Krishnaswamy, and V . Radhakrishnan, “Interoperability for provenance-aware databases using {PROV} and {JSON},” in7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15), 2015

  42. [42]

    Ai-assisted json schema creation and mapping,

    F. Neubauer, B. Uekermann, and J. Pleiss, “Ai-assisted json schema creation and mapping,” in2025 ACM/IEEE 28th International Conference on Model Driven Engineering Languages and Systems Compan- ion (MODELS-C). IEEE, 2025, pp. 79–83

  43. [43]

    Survey on json data modelling,

    T. Lv, P. Yan, and W. He, “Survey on json data modelling,” inJournal of physics: conference series, vol. 1069, no. 1. IOP Publishing, 2018, p. 012101

  44. [44]

    Jindex: Json and index search system for plant germplasm database,

    T. Whairit, B. Phadermrod, and V . Attasena, “Jindex: Json and index search system for plant germplasm database,”Journal of King Saud University-Computer and Information Sciences, vol. 35, no. 8, p. 101701, 2023. 19 20