BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task

Lama Alkhaled; Marcus Liwicki; Martin Karlsson; Tosin Adewumi

arxiv: 2604.26986 · v1 · submitted 2026-04-28 · 💻 cs.CL

BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task

Tosin Adewumi , Martin Karlsson , Lama Alkhaled , Marcus Liwicki This is my paper

Pith reviewed 2026-05-07 16:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords digital battery passportconformance classificationbenchmark datasetlanguage model evaluationEU battery regulationsynthetic datazero-shot inferenceprompt injection

0 comments

The pith

The paper introduces the first public dataset for classifying whether digital battery passports conform to regulatory standards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors establish a new task of digital battery passport conformance classification because the EU regulation requiring these passports is about to take effect with no existing public data for automation. They generate BatteryPass-12K synthetically from real pilot samples to create the initial benchmark and evaluate 22 language models across zero-shot and few-shot inference. Results indicate that models with explicit reasoning steps achieve the highest scores while simply increasing parameter count does not reliably improve outcomes. Few-shot prompting raises performance but prompt-injection attacks reduce it, and the dataset may extend to related battery-domain tasks such as lifecycle analysis.

Core claim

The central claim is that BatteryPass-12K supplies the first public benchmark for DBP conformance classification, generated synthetically from real pilot samples, and that evaluations of 22 language models reveal thinking models reaching 0.98 F1 on validation and 0.71 on test while few-shot examples help, scaling alone does not, and prompt attacks degrade results.

What carries the argument

The synthetic generation process that turns real pilot battery passport samples into labeled conformance classification examples.

If this is right

Thinking models outperform both smaller LMs and many larger dense models on the task.
Adding a few labeled examples raises average performance across model sizes.
Generally capable frontier models still struggle with the classification.
Parameter scaling alone does not guarantee better results since some small LMs beat certain LLMs.
Prompt-injection attacks reduce classification accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automated tools built on this benchmark could help manufacturers prepare compliance systems ahead of the EU deadline.
The gap between validation and test performance points to the value of collecting more diverse real-world passport data over time.
The synthetic-from-pilot method may transfer to other emerging regulatory documentation tasks in energy storage or supply chains.

Load-bearing premise

The synthetic examples drawn from pilot samples accurately reflect the distribution and edge cases that appear in actual deployed digital battery passports.

What would settle it

Running the same models on a new collection of real, non-pilot digital battery passports and finding substantially lower F1 scores or different error patterns than on BatteryPass-12K.

Figures

Figures reproduced from arXiv: 2604.26986 by Lama Alkhaled, Marcus Liwicki, Martin Karlsson, Tosin Adewumi.

**Figure 1.** Figure 1: Flowchart of the BatteryPass-12K data and metadata generation pipeline. view at source ↗

**Figure 2.** Figure 2: Scaling parameters vs performance on the validation set. Using 2-sigma error bars based view at source ↗

**Figure 3.** Figure 3: Confusion matrices for the test set. Correctly predicting conformant samples is more view at source ↗

**Figure 4.** Figure 4: Confusion matrices of adversarial attacks for the test set with GPT-5.4 view at source ↗

read the original abstract

We introduce a novel task of digital battery passport (DBP) conformance classification and introduce the first public benchmark for the task: BatteryPass-12K, created synthetically from real pilot samples. This is as the EU's battery regulation on DBPs comes into effect soon and there exists no public dataset. We evaluated 22 language models (LMs) in zero-shot inference, spanning small LMs (SLMs), mixture of experts (MoEs), and dense LLMs. We also conducted analysis, additional evaluations of few-shot inference and prompt-injection attacks to find that (1) Thinking models have the best performance (with GPT-5.4 scoring 0.98 (0.03) and 0.71 (0.22) on average as F1 (and confidence interval at 95%) on the validation and test sets, respectively), (2) few-shot examples improve performance significantly, (3) generally capable frontier models find the task challenging, (4) merely scaling model parameters does not necessarily lead to improved performance, as SLMs outperformed some LLMs, and (5) prompt-injection attacks degrade performance. We note that BatteryPass-12K, though limited to real pilot samples, may be useful for other known or emerging tasks in the battery domain, e.g. lifecycle reasoning. We publicly release the dataset under a permissive licence (CC-BY-4.0).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases the first public dataset for digital battery passport conformance classification along with model baselines, but the synthetic generation from pilot samples lacks validation against real distributions.

read the letter

The key point is that this work creates and releases BatteryPass-12K, the first public benchmark for classifying whether digital battery passports meet EU regulatory requirements. They built it synthetically from real pilot samples and tested 22 language models in zero-shot settings, with additional runs on few-shot prompting and prompt-injection attacks. The dataset comes out under CC-BY-4.0, which is a practical move for a new task with no prior public resources.

Referee Report

3 major / 2 minor

Summary. The paper introduces a novel task of digital battery passport (DBP) conformance classification and presents BatteryPass-12K as the first public benchmark for it, synthetically generated from real pilot samples under the upcoming EU battery regulation. It evaluates 22 language models (SLMs, MoEs, dense LLMs) in zero-shot inference, reports F1 scores with 95% CIs (e.g., GPT-5.4 at 0.98/0.71 on validation/test), analyzes few-shot improvements and prompt-injection degradation, and publicly releases the dataset under CC-BY-4.0, suggesting utility for related battery tasks like lifecycle reasoning.

Significance. If the synthetic data faithfully represents real-world DBP conformance distributions and edge cases, the work fills a timely gap by providing the first public benchmark for an emerging regulatory task with no prior public datasets. The broad model evaluation, few-shot and adversarial analyses, and permissive release are strengths that could support further research in the battery domain. The empirical findings (thinking models best, scaling not always helpful, prompt injection harmful) offer practical insights conditional on dataset quality.

major comments (3)

[Dataset creation section] Dataset creation section: The synthesis process from real pilot samples is described only at a high level with no enumeration of rules, templates, generation parameters, or coverage of rare conformance failures. This is load-bearing for the central claim that BatteryPass-12K is a valid first public benchmark, as reproducibility and edge-case assessment are impossible without these details.
[Evaluation section] Evaluation section: No quantitative validation metrics (e.g., statistical distance such as KL divergence or Wasserstein distance on label distributions, feature statistics, or held-out real samples) are reported to support the representativeness assumption. All model rankings and F1 results (including 0.98/0.71 for GPT-5.4 and SLM vs. LLM comparisons) are therefore conditional on an untested assumption that directly affects benchmark validity.
[§4 (Results and Analysis)] §4 (Results and Analysis): The reported splits, exact label balance, and how the 12K samples were divided into validation/test sets are not detailed, nor is any audit of edge-case coverage. This undermines the cross-model and few-shot claims, as performance could be an artifact of synthetic construction rather than task difficulty.

minor comments (2)

[Abstract] Abstract: The phrasing 'This is as the EU's battery regulation on DBPs comes into effect soon' is grammatically unclear and should be revised for readability.
[Results section] The manuscript would benefit from a table summarizing the 22 models by type (SLM/MoE/LLM), parameter count, and key F1 results for quick reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the timeliness of introducing the first public benchmark for digital battery passport conformance classification. We address each major comment point by point below, proposing specific revisions to improve clarity, reproducibility, and transparency while preserving the manuscript's core contributions.

read point-by-point responses

Referee: [Dataset creation section] Dataset creation section: The synthesis process from real pilot samples is described only at a high level with no enumeration of rules, templates, generation parameters, or coverage of rare conformance failures. This is load-bearing for the central claim that BatteryPass-12K is a valid first public benchmark, as reproducibility and edge-case assessment are impossible without these details.

Authors: We agree that greater detail on the synthesis process is necessary for full reproducibility. In the revised manuscript, we will expand the Dataset Creation section to enumerate the specific rules, templates, and generation parameters derived from the real pilot samples. We will also add a subsection on edge-case coverage, describing how rare conformance failures observed in the pilots were incorporated into the synthetic generation. The complete BatteryPass-12K dataset remains publicly available under CC-BY-4.0, enabling independent inspection and verification of the resulting samples. revision: yes
Referee: [Evaluation section] Evaluation section: No quantitative validation metrics (e.g., statistical distance such as KL divergence or Wasserstein distance on label distributions, feature statistics, or held-out real samples) are reported to support the representativeness assumption. All model rankings and F1 results (including 0.98/0.71 for GPT-5.4 and SLM vs. LLM comparisons) are therefore conditional on an untested assumption that directly affects benchmark validity.

Authors: We acknowledge that explicit quantitative validation metrics would strengthen claims of representativeness. However, the synthesis was performed from a limited set of proprietary real pilot samples, precluding computation of distances such as KL divergence or Wasserstein against a large held-out real distribution. In the revision, we will add a Limitations subsection that explicitly states this assumption, reports the observed label distributions and feature statistics within BatteryPass-12K, and compares them where possible to regulatory expectations. We will also qualify all performance claims accordingly while noting that the grounded synthetic process provides a practical first benchmark for this novel regulatory task. revision: partial
Referee: [§4 (Results and Analysis)] §4 (Results and Analysis): The reported splits, exact label balance, and how the 12K samples were divided into validation/test sets are not detailed, nor is any audit of edge-case coverage. This undermines the cross-model and few-shot claims, as performance could be an artifact of synthetic construction rather than task difficulty.

Authors: We will revise §4 to provide full details on the data splits, including the exact label balance across the 12K samples and the division methodology (stratified random split with explicit proportions for validation and test sets). We will also include an audit of edge-case coverage, mapping how conformance failure types from the pilot samples are represented in each split. These additions will clarify that observed performance differences, including few-shot gains and model comparisons, reflect task characteristics rather than construction artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset creation and empirical benchmarking stand independently

full rationale

The paper introduces a novel DBP conformance classification task and releases BatteryPass-12K, a synthetic dataset derived from real pilot samples, followed by zero-shot and few-shot evaluations of 22 language models plus prompt-injection tests. No derivation chain, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The central claim (first public benchmark for the task) rests on the act of synthesis and public release under CC-BY-4.0, which is self-contained and externally verifiable without reference to prior author results. The representativeness of the synthetic data is an empirical assumption open to external falsification, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that synthetic data from pilot samples is representative and that language models can meaningfully classify regulatory conformance text.

axioms (2)

ad hoc to paper Synthetic data generated from real pilot samples is representative of real-world DBP conformance cases
Invoked in the creation of BatteryPass-12K as described in the abstract
domain assumption Language models can perform zero-shot and few-shot text classification on regulatory documents
Basis for evaluating 22 LMs including SLMs, MoEs, and LLMs

pith-pipeline@v0.9.0 · 5564 in / 1318 out tokens · 88848 ms · 2026-05-07T16:10:30.267244+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

[1]

Feasibility of meeting future battery demand via domestic cell production in europe,

S. Link, L. Schneider, A. Stephan, L. Weymann, and P. Plötz, “Feasibility of meeting future battery demand via domestic cell production in europe,”Nature Energy, vol. 10, no. 4, pp. 526–534, 2025

work page 2025
[2]

Regulation (eu) 2023/1542 of the european parliament and of the council,

Act, “Regulation (eu) 2023/1542 of the european parliament and of the council,”Regulation (eu), 2023

work page 2023
[3]

Towards to battery digital passport: reviewing regulations and standards for second-life batteries,

C. A. Rufino Júnior, E. Riva Sanseverino, P. Gallo, D. Koch, S. Diel, G. Walter, L. Trilla, V . J. Ferreira, G. B. Pérez, Y . Kotaket al., “Towards to battery digital passport: reviewing regulations and standards for second-life batteries,”Batteries, vol. 10, no. 4, p. 115, 2024

work page 2024
[4]

The quest for more circular battery value chains: Implementing the eu digital battery passport and remaining challenges,

R. Losa and S. Torjesen, “The quest for more circular battery value chains: Implementing the eu digital battery passport and remaining challenges,”Cleaner Production Letters, p. 100118, 2025

work page 2025
[5]

Contradictions and inconsistencies in regulatory documents–a qualitative assessment from practice,

G. Schumann, “Contradictions and inconsistencies in regulatory documents–a qualitative assessment from practice,” in2025 15th International Conference on Advanced Computer Information Technologies (ACIT). IEEE, 2025, pp. 273–282

work page 2025
[6]

Automating dataset updates towards reliable and timely evaluation of large language models,

J. Ying, Y . Cao, Y . Bai, Q. Sun, B. Wang, W. Tang, Z. Ding, Y . Yang, X. Huang, and S. Yan, “Automating dataset updates towards reliable and timely evaluation of large language models,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates,...

work page 2024
[7]

Generate, annotate, and learn: Nlp with synthetic text,

X. He, I. Nassar, J. Kiros, G. Haffari, and M. Norouzi, “Generate, annotate, and learn: Nlp with synthetic text,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 826–842, 2022

work page 2022
[8]

Synthetic data generation using large language models: Advances in text and code,

M. Nad˘as,, L. Dios, an, and A. Tomescu, “Synthetic data generation using large language models: Advances in text and code,”IEEE Access, 2025. 16

work page 2025
[9]

Genai synthetic data create ethical challenges for scientists. here’s how to address them

D. B. Resnik, M. Hosseini, J. J. Kim, G. Epiphaniou, and C. Maple, “Genai synthetic data create ethical challenges for scientists. here’s how to address them.”Proceedings of the National Academy of Sciences, vol. 122, no. 9, p. e2409182122, 2025

work page 2025
[10]

Machine actionable metadata models,

D. Batista, A. Gonzalez-Beltran, S.-A. Sansone, and P. Rocca-Serra, “Machine actionable metadata models,” Scientific Data, vol. 9, no. 1, p. 592, 2022

work page 2022
[11]

Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks,

A. Szymanski, N. Ziems, H. A. Eicher-Miller, T. J.-J. Li, M. Jiang, and R. A. Metoyer, “Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks,” inProceedings of the 30th International Conference on Intelligent User Interfaces, 2025, pp. 952–966

work page 2025
[12]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

work page 2023
[13]

Data statements for natural language processing: Toward mitigating system bias and enabling better science,

E. M. Bender and B. Friedman, “Data statements for natural language processing: Toward mitigating system bias and enabling better science,”Transactions of the Association for Computational Linguistics, vol. 6, pp. 587–604, 2018. [Online]. Available: https://aclanthology.org/Q18-1041/

work page 2018
[14]

Complementarity, f-score, and nlp evaluation,

L. Derczynski, “Complementarity, f-score, and nlp evaluation,” inProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 261–266

work page 2016
[15]

A review of evaluation metrics in machine learning algorithms,

G. Naidu, T. Zuva, and E. M. Sibanda, “A review of evaluation metrics in machine learning algorithms,” in Computer science on-line conference. Springer, 2023, pp. 15–25

work page 2023
[16]

Scikit-learn: Machine learning in python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,”the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011

work page 2011
[17]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901
[18]

RAFT: A real-world few-shot text classification benchmark,

N. Alex, E. Lifland, L. Tunstall, A. Thakur, P. Maham, C. J. Riedel, E. Hine, C. Ashurst, P. Sedille, A. Carlier, M. Noetel, and A. Stuhlmüller, “RAFT: A real-world few-shot text classification benchmark,” in NeurIPS Datasets and Benchmarks, 2021. [Online]. Available: https://datasets-benchmarks-proceedings. neurips.cc/paper/2021/file/ca46c1b9512a7a8315fa...

work page 2021
[19]

Clues: Few-shot learning evaluation in natural language understanding,

S. Mukherjee, X. Liu, G. Zheng, S. Hosseini, H. Cheng, G. Yang, C. Meek, A. Awadallah, and J. Gao, “Clues: Few-shot learning evaluation in natural language understanding,” inThirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021. [Online]. Available: https://datasets-benchmarks-proceedings.ne...

work page 2021
[20]

True few-shot learning with language models,

E. Perez, D. Kiela, and K. Cho, “True few-shot learning with language models,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 11 054–11 070. [Online]. Available: https: //proceedings.neurips.cc/paper_files/paper/2021/file/5c0492567...

work page 2021
[21]

A systematic review of prompt injection attacks on large language models: Trends, taxonomy, evaluation, defenses and opportunities,

J. D. Duarte, G. D. Cândido, J. R. A. De Britto Filho, J. S. Neto, E. J. Costa, J. P. J. Da Costa, and L. P. De Melo, “A systematic review of prompt injection attacks on large language models: Trends, taxonomy, evaluation, defenses and opportunities,”IEEE Access, 2026

work page 2026
[22]

The synthetic data vault,

N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data vault,” in2016 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 2016, pp. 399–410

work page 2016
[23]

Learning to generate synthetic data via compositing,

S. Tripathi, S. Chandra, A. Agrawal, A. Tyagi, J. M. Rehg, and V . Chari, “Learning to generate synthetic data via compositing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 461–470

work page 2019
[24]

Synthetic data generation: A comparative study,

M. Endres, A. Mannarapotta Venugopal, and T. S. Tran, “Synthetic data generation: A comparative study,” inProceedings of the 26th international database engineered applications symposium, 2022, pp. 94–102

work page 2022
[25]

Synthetic data generation with large language models for text classification: Potential and limitations,

Z. Li, H. Zhu, Z. Lu, and M. Yin, “Synthetic data generation with large language models for text classification: Potential and limitations,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 10 443–10 461. 17

work page 2023
[26]

On llms-driven synthetic data gen- eration, curation, and evaluation: A survey,

L. Long, R. Wang, R. Xiao, J. Zhao, X. Ding, G. Chen, and H. Wang, “On llms-driven synthetic data gen- eration, curation, and evaluation: A survey,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 11 065–11 082

work page 2024
[27]

Sequential subset matching for dataset distillation,

J. Du, Q. Shi, and J. T. Zhou, “Sequential subset matching for dataset distillation,”Advances in Neural Information Processing Systems, vol. 36, pp. 67 487–67 504, 2023

work page 2023
[28]

Helpsteer 2: Open-source dataset for training top-performing reward models,

Z. Wang, Y . Dong, O. Delalleau, J. Zeng, G. Shen, D. Egert, J. J. Zhang, M. N. Sreedhar, and O. Kuchaiev, “Helpsteer 2: Open-source dataset for training top-performing reward models,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, In...

work page 2024
[29]

Improving LLM-as-a-judge inference with the judgment distribution,

V . Wang, M. J. Zhang, and E. Choi, “Improving LLM-as-a-judge inference with the judgment distribution,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 23 173–23 199. [Online]. Available: htt...

work page 2025
[30]

Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering,

R. Wang, J. Guo, C. Gao, G. Fan, C. Y . Chong, and X. Xia, “Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1955–1977, 2025

work page 1955
[31]

From generation to judgment: Opportunities and challenges of LLM-as-a-judge,

D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y . Jiang, C. Chen, T. Wu, K. Shu, L. Cheng, and H. Liu, “From generation to judgment: Opportunities and challenges of LLM-as-a-judge,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Pen...

work page 2025
[32]

Ask me like I’m human: LLM-based evaluation with for-human instructions correlates better with human evaluations than human judges,

R. Huidrom and A. Belz, “Ask me like I’m human: LLM-based evaluation with for-human instructions correlates better with human evaluations than human judges,” inProceedings of the 4th Table Representation Learning Workshop, S. Chang, M. Hulsebos, Q. Liu, W. Chen, and H. Sun, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 98...

work page 2025
[33]

Training compute-optimal large language models,

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” inAdvances in Neural Inform...

work page 2022
[34]

Observational scaling laws and the predictability of langauge model performance,

Y . Ruan, C. J. Maddison, and T. Hashimoto, “Observational scaling laws and the predictability of langauge model performance,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 15 841–15 892. [Online]. Available: https://...

work page 2024
[35]

On the codesign of scientific experiments and industrial systems,

T. Dorigo, P. Vischia, S. Abbas, T. Adewumi, L. Alkhaled, L. Arsini, M. Awais, M. Borisyak, A. Bóta, F. Buryet al., “On the codesign of scientific experiments and industrial systems,”arXiv preprint arXiv:2603.26613, 2026

work page arXiv 2026
[36]

Compute optimal scaling of skills: Knowledge vs reasoning,

N. Roberts, N. S. Chatterji, S. Narang, M. Lewis, and D. Hupkes, “Compute optimal scaling of skills: Knowledge vs reasoning,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 13 295–13 316. [Online]. Ava...

work page 2025
[37]

On the limitations of large language models (LLMs): False attribution,

T. Adewumi, N. Habib, L. Alkhaled, and E. Barney, “On the limitations of large language models (LLMs): False attribution,” inProceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, G. Angelova, M. Kunilovskaya, M. Escribe, and R. Mitkov, Eds. Varna, Bulgaria...

work page 2025
[38]

Ai must not be fully autonomous,

T. Adewumi, L. Alkhaled, F. Imbert, H. Han, N. Habib, and K. Löwenmark, “Ai must not be fully autonomous,”arXiv preprint arXiv:2507.23330, 2025. 18

work page arXiv 2025
[39]

Judging the judges: A systematic study of position bias in llm-as-a-judge,

L. Shi, C. Ma, W. Liang, X. Diao, W. Ma, and S. V osoughi, “Judging the judges: A systematic study of position bias in llm-as-a-judge,” inProceedings of the 14th International Joint Conference on Natural Lan- guage Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 292–314

work page 2025
[40]

The javascript object notation (json) data interchange format,

T. Bray, “The javascript object notation (json) data interchange format,” https://www.rfc-editor.org/rfc/ rfc8259, 2017, rFC 8259

work page 2017
[41]

Interoperability for provenance-aware databases using {PROV} and {JSON},

X. Niu, B. Glavic, D. Gawlick, Z. H. Liu, V . Krishnaswamy, and V . Radhakrishnan, “Interoperability for provenance-aware databases using {PROV} and {JSON},” in7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15), 2015

work page 2015
[42]

Ai-assisted json schema creation and mapping,

F. Neubauer, B. Uekermann, and J. Pleiss, “Ai-assisted json schema creation and mapping,” in2025 ACM/IEEE 28th International Conference on Model Driven Engineering Languages and Systems Compan- ion (MODELS-C). IEEE, 2025, pp. 79–83

work page 2025
[43]

Survey on json data modelling,

T. Lv, P. Yan, and W. He, “Survey on json data modelling,” inJournal of physics: conference series, vol. 1069, no. 1. IOP Publishing, 2018, p. 012101

work page 2018
[44]

Jindex: Json and index search system for plant germplasm database,

T. Whairit, B. Phadermrod, and V . Attasena, “Jindex: Json and index search system for plant germplasm database,”Journal of King Saud University-Computer and Information Sciences, vol. 35, no. 8, p. 101701, 2023. 19 20

work page 2023

[1] [1]

Feasibility of meeting future battery demand via domestic cell production in europe,

S. Link, L. Schneider, A. Stephan, L. Weymann, and P. Plötz, “Feasibility of meeting future battery demand via domestic cell production in europe,”Nature Energy, vol. 10, no. 4, pp. 526–534, 2025

work page 2025

[2] [2]

Regulation (eu) 2023/1542 of the european parliament and of the council,

Act, “Regulation (eu) 2023/1542 of the european parliament and of the council,”Regulation (eu), 2023

work page 2023

[3] [3]

Towards to battery digital passport: reviewing regulations and standards for second-life batteries,

C. A. Rufino Júnior, E. Riva Sanseverino, P. Gallo, D. Koch, S. Diel, G. Walter, L. Trilla, V . J. Ferreira, G. B. Pérez, Y . Kotaket al., “Towards to battery digital passport: reviewing regulations and standards for second-life batteries,”Batteries, vol. 10, no. 4, p. 115, 2024

work page 2024

[4] [4]

The quest for more circular battery value chains: Implementing the eu digital battery passport and remaining challenges,

R. Losa and S. Torjesen, “The quest for more circular battery value chains: Implementing the eu digital battery passport and remaining challenges,”Cleaner Production Letters, p. 100118, 2025

work page 2025

[5] [5]

Contradictions and inconsistencies in regulatory documents–a qualitative assessment from practice,

G. Schumann, “Contradictions and inconsistencies in regulatory documents–a qualitative assessment from practice,” in2025 15th International Conference on Advanced Computer Information Technologies (ACIT). IEEE, 2025, pp. 273–282

work page 2025

[6] [6]

Automating dataset updates towards reliable and timely evaluation of large language models,

J. Ying, Y . Cao, Y . Bai, Q. Sun, B. Wang, W. Tang, Z. Ding, Y . Yang, X. Huang, and S. Yan, “Automating dataset updates towards reliable and timely evaluation of large language models,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates,...

work page 2024

[7] [7]

Generate, annotate, and learn: Nlp with synthetic text,

X. He, I. Nassar, J. Kiros, G. Haffari, and M. Norouzi, “Generate, annotate, and learn: Nlp with synthetic text,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 826–842, 2022

work page 2022

[8] [8]

Synthetic data generation using large language models: Advances in text and code,

M. Nad˘as,, L. Dios, an, and A. Tomescu, “Synthetic data generation using large language models: Advances in text and code,”IEEE Access, 2025. 16

work page 2025

[9] [9]

Genai synthetic data create ethical challenges for scientists. here’s how to address them

D. B. Resnik, M. Hosseini, J. J. Kim, G. Epiphaniou, and C. Maple, “Genai synthetic data create ethical challenges for scientists. here’s how to address them.”Proceedings of the National Academy of Sciences, vol. 122, no. 9, p. e2409182122, 2025

work page 2025

[10] [10]

Machine actionable metadata models,

D. Batista, A. Gonzalez-Beltran, S.-A. Sansone, and P. Rocca-Serra, “Machine actionable metadata models,” Scientific Data, vol. 9, no. 1, p. 592, 2022

work page 2022

[11] [11]

Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks,

A. Szymanski, N. Ziems, H. A. Eicher-Miller, T. J.-J. Li, M. Jiang, and R. A. Metoyer, “Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks,” inProceedings of the 30th International Conference on Intelligent User Interfaces, 2025, pp. 952–966

work page 2025

[12] [12]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

work page 2023

[13] [13]

Data statements for natural language processing: Toward mitigating system bias and enabling better science,

E. M. Bender and B. Friedman, “Data statements for natural language processing: Toward mitigating system bias and enabling better science,”Transactions of the Association for Computational Linguistics, vol. 6, pp. 587–604, 2018. [Online]. Available: https://aclanthology.org/Q18-1041/

work page 2018

[14] [14]

Complementarity, f-score, and nlp evaluation,

L. Derczynski, “Complementarity, f-score, and nlp evaluation,” inProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 261–266

work page 2016

[15] [15]

A review of evaluation metrics in machine learning algorithms,

G. Naidu, T. Zuva, and E. M. Sibanda, “A review of evaluation metrics in machine learning algorithms,” in Computer science on-line conference. Springer, 2023, pp. 15–25

work page 2023

[16] [16]

Scikit-learn: Machine learning in python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,”the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011

work page 2011

[17] [17]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901

[18] [18]

RAFT: A real-world few-shot text classification benchmark,

N. Alex, E. Lifland, L. Tunstall, A. Thakur, P. Maham, C. J. Riedel, E. Hine, C. Ashurst, P. Sedille, A. Carlier, M. Noetel, and A. Stuhlmüller, “RAFT: A real-world few-shot text classification benchmark,” in NeurIPS Datasets and Benchmarks, 2021. [Online]. Available: https://datasets-benchmarks-proceedings. neurips.cc/paper/2021/file/ca46c1b9512a7a8315fa...

work page 2021

[19] [19]

Clues: Few-shot learning evaluation in natural language understanding,

S. Mukherjee, X. Liu, G. Zheng, S. Hosseini, H. Cheng, G. Yang, C. Meek, A. Awadallah, and J. Gao, “Clues: Few-shot learning evaluation in natural language understanding,” inThirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021. [Online]. Available: https://datasets-benchmarks-proceedings.ne...

work page 2021

[20] [20]

True few-shot learning with language models,

E. Perez, D. Kiela, and K. Cho, “True few-shot learning with language models,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 11 054–11 070. [Online]. Available: https: //proceedings.neurips.cc/paper_files/paper/2021/file/5c0492567...

work page 2021

[21] [21]

A systematic review of prompt injection attacks on large language models: Trends, taxonomy, evaluation, defenses and opportunities,

J. D. Duarte, G. D. Cândido, J. R. A. De Britto Filho, J. S. Neto, E. J. Costa, J. P. J. Da Costa, and L. P. De Melo, “A systematic review of prompt injection attacks on large language models: Trends, taxonomy, evaluation, defenses and opportunities,”IEEE Access, 2026

work page 2026

[22] [22]

The synthetic data vault,

N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data vault,” in2016 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 2016, pp. 399–410

work page 2016

[23] [23]

Learning to generate synthetic data via compositing,

S. Tripathi, S. Chandra, A. Agrawal, A. Tyagi, J. M. Rehg, and V . Chari, “Learning to generate synthetic data via compositing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 461–470

work page 2019

[24] [24]

Synthetic data generation: A comparative study,

M. Endres, A. Mannarapotta Venugopal, and T. S. Tran, “Synthetic data generation: A comparative study,” inProceedings of the 26th international database engineered applications symposium, 2022, pp. 94–102

work page 2022

[25] [25]

Synthetic data generation with large language models for text classification: Potential and limitations,

Z. Li, H. Zhu, Z. Lu, and M. Yin, “Synthetic data generation with large language models for text classification: Potential and limitations,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 10 443–10 461. 17

work page 2023

[26] [26]

On llms-driven synthetic data gen- eration, curation, and evaluation: A survey,

L. Long, R. Wang, R. Xiao, J. Zhao, X. Ding, G. Chen, and H. Wang, “On llms-driven synthetic data gen- eration, curation, and evaluation: A survey,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 11 065–11 082

work page 2024

[27] [27]

Sequential subset matching for dataset distillation,

J. Du, Q. Shi, and J. T. Zhou, “Sequential subset matching for dataset distillation,”Advances in Neural Information Processing Systems, vol. 36, pp. 67 487–67 504, 2023

work page 2023

[28] [28]

Helpsteer 2: Open-source dataset for training top-performing reward models,

Z. Wang, Y . Dong, O. Delalleau, J. Zeng, G. Shen, D. Egert, J. J. Zhang, M. N. Sreedhar, and O. Kuchaiev, “Helpsteer 2: Open-source dataset for training top-performing reward models,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, In...

work page 2024

[29] [29]

Improving LLM-as-a-judge inference with the judgment distribution,

V . Wang, M. J. Zhang, and E. Choi, “Improving LLM-as-a-judge inference with the judgment distribution,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 23 173–23 199. [Online]. Available: htt...

work page 2025

[30] [30]

Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering,

R. Wang, J. Guo, C. Gao, G. Fan, C. Y . Chong, and X. Xia, “Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1955–1977, 2025

work page 1955

[31] [31]

From generation to judgment: Opportunities and challenges of LLM-as-a-judge,

D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y . Jiang, C. Chen, T. Wu, K. Shu, L. Cheng, and H. Liu, “From generation to judgment: Opportunities and challenges of LLM-as-a-judge,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Pen...

work page 2025

[32] [32]

Ask me like I’m human: LLM-based evaluation with for-human instructions correlates better with human evaluations than human judges,

R. Huidrom and A. Belz, “Ask me like I’m human: LLM-based evaluation with for-human instructions correlates better with human evaluations than human judges,” inProceedings of the 4th Table Representation Learning Workshop, S. Chang, M. Hulsebos, Q. Liu, W. Chen, and H. Sun, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 98...

work page 2025

[33] [33]

Training compute-optimal large language models,

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” inAdvances in Neural Inform...

work page 2022

[34] [34]

Observational scaling laws and the predictability of langauge model performance,

Y . Ruan, C. J. Maddison, and T. Hashimoto, “Observational scaling laws and the predictability of langauge model performance,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 15 841–15 892. [Online]. Available: https://...

work page 2024

[35] [35]

On the codesign of scientific experiments and industrial systems,

T. Dorigo, P. Vischia, S. Abbas, T. Adewumi, L. Alkhaled, L. Arsini, M. Awais, M. Borisyak, A. Bóta, F. Buryet al., “On the codesign of scientific experiments and industrial systems,”arXiv preprint arXiv:2603.26613, 2026

work page arXiv 2026

[36] [36]

Compute optimal scaling of skills: Knowledge vs reasoning,

N. Roberts, N. S. Chatterji, S. Narang, M. Lewis, and D. Hupkes, “Compute optimal scaling of skills: Knowledge vs reasoning,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 13 295–13 316. [Online]. Ava...

work page 2025

[37] [37]

On the limitations of large language models (LLMs): False attribution,

T. Adewumi, N. Habib, L. Alkhaled, and E. Barney, “On the limitations of large language models (LLMs): False attribution,” inProceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, G. Angelova, M. Kunilovskaya, M. Escribe, and R. Mitkov, Eds. Varna, Bulgaria...

work page 2025

[38] [38]

Ai must not be fully autonomous,

T. Adewumi, L. Alkhaled, F. Imbert, H. Han, N. Habib, and K. Löwenmark, “Ai must not be fully autonomous,”arXiv preprint arXiv:2507.23330, 2025. 18

work page arXiv 2025

[39] [39]

Judging the judges: A systematic study of position bias in llm-as-a-judge,

L. Shi, C. Ma, W. Liang, X. Diao, W. Ma, and S. V osoughi, “Judging the judges: A systematic study of position bias in llm-as-a-judge,” inProceedings of the 14th International Joint Conference on Natural Lan- guage Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 292–314

work page 2025

[40] [40]

The javascript object notation (json) data interchange format,

T. Bray, “The javascript object notation (json) data interchange format,” https://www.rfc-editor.org/rfc/ rfc8259, 2017, rFC 8259

work page 2017

[41] [41]

Interoperability for provenance-aware databases using {PROV} and {JSON},

X. Niu, B. Glavic, D. Gawlick, Z. H. Liu, V . Krishnaswamy, and V . Radhakrishnan, “Interoperability for provenance-aware databases using {PROV} and {JSON},” in7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15), 2015

work page 2015

[42] [42]

Ai-assisted json schema creation and mapping,

F. Neubauer, B. Uekermann, and J. Pleiss, “Ai-assisted json schema creation and mapping,” in2025 ACM/IEEE 28th International Conference on Model Driven Engineering Languages and Systems Compan- ion (MODELS-C). IEEE, 2025, pp. 79–83

work page 2025

[43] [43]

Survey on json data modelling,

T. Lv, P. Yan, and W. He, “Survey on json data modelling,” inJournal of physics: conference series, vol. 1069, no. 1. IOP Publishing, 2018, p. 012101

work page 2018

[44] [44]

Jindex: Json and index search system for plant germplasm database,

T. Whairit, B. Phadermrod, and V . Attasena, “Jindex: Json and index search system for plant germplasm database,”Journal of King Saud University-Computer and Information Sciences, vol. 35, no. 8, p. 101701, 2023. 19 20

work page 2023