arxiv: 2604.24544 · v1 · submitted 2026-04-27 · 💻 cs.AI · cs.CL

Recognition: unknown

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

Alessio Sordo , Lingxiao Du , Meeka-Hanna Lenisa , Evgeny Bogdanov , Maxim Romanovsky

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords synthetic data generationLLM benchmarkingautomated evaluationmultilingual datasetsSelf-Instruct methodLLM-as-a-judgedomain-specific testing

0 comments

The pith

STELLAR-E creates synthetic datasets for LLM evaluation that differ by only 5.7 percent from real benchmarks on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system for automatically generating synthetic datasets to test large language models in specific domains and languages. This addresses the difficulties of collecting real data due to privacy, regulations, and manual effort. The approach uses a modified self-instruct method to produce the data and then applies both statistical measures and LLM judgments to verify its quality. Results show the synthetic sets perform comparably, with scores averaging 5.7 percent higher than existing benchmarks. This offers a scalable way to create tailored evaluation tools for assessing both large and small models without relying on pre-existing data.

Core claim

STELLAR-E is a fully automated two-stage system that modifies the TGRT Self-Instruct framework to generate custom synthetic datasets of any size and then uses an evaluation pipeline with statistical and LLM-as-a-judge metrics to confirm their quality; these datasets show an average difference of +5.7% in LLM-as-a-judge scores compared to language-specific benchmarks, indicating they are suitable for comprehensive LLM application assessment.

What carries the argument

The two-stage structure: a synthetic data engine based on modified Self-Instruct for controllable generation and an evaluation pipeline that combines statistical metrics with LLM-based scoring to assess dataset applicability.

Load-bearing premise

LLM-as-a-judge scores along with the chosen statistical metrics provide a reliable indication of the synthetic datasets' quality for real-world use and do not add biases not present in human-curated data.

What would settle it

A study showing that LLM rankings or performance assessments differ substantially when using the synthetic datasets versus traditional benchmarks, or human review revealing quality issues missed by the automated metrics.

Figures

Figures reproduced from arXiv: 2604.24544 by Alessio Sordo, Evgeny Bogdanov, Lingxiao Du, Maxim Romanovsky, Meeka-Hanna Lenisa.

**Figure 1.** Figure 1: Overview of generation pipeline types, followed by a topic filtering phase. Next, a random subset of the filtered topics is selected, and for each topic set, j instructions are generated using prompts specifically designed to maximize both diversity and coverage. Subsequent quality improvement and difficulty enhancement steps further refine the instruction set and its corresponding answer set. We developed… view at source ↗

**Figure 2.** Figure 2: Quality Improvement medium-quality instances, may not be enough precise compared to a feedback loop approach [18, 12], which is also implemented in our pipeline. In this approach, the instances deemed low-quality based on evaluation metrics that express important criteria for the evaluation are given to an LLM that provides a feedback to the generation LLM to re-generate the instance. The process is repeat… view at source ↗

**Figure 3.** Figure 3: Language Datasets Diagram To evaluate the system’s ability to generate language-specific datasets, we used the professionally translated Mintaka dataset as our ground-truth benchmark [25]. Our experiment involved datasets in both English and Italian, structured as follows (see view at source ↗

read the original abstract

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents STELLAR-E, a two-stage automated framework for generating custom-sized synthetic evaluation datasets for LLMs without relying on existing data. Stage 1 modifies the TGRT Self-Instruct pipeline for controllable, domain- and language-adaptable generation; Stage 2 applies an evaluation pipeline combining statistical metrics and LLM-as-a-judge scoring. The central empirical claim is that the resulting synthetic datasets achieve an average +5.7% higher LLM-as-a-judge score than existing language-specific benchmarks, supporting their use for comprehensive assessment of both large and small LLMs while noting that real datasets remain slightly more challenging.

Significance. If the core claim can be externally validated, the work would offer a practical, scalable alternative to labor-intensive manual benchmark creation, directly addressing privacy, regulatory, and cost barriers in multilingual and domain-specific LLM evaluation. The emphasis on minimal human input and end-to-end automation is a strength for rapid iteration in application-specific testing.

major comments (3)

[Abstract, §4] Abstract and §4 (Results): The central +5.7% average difference in LLM-as-a-judge scores is reported without sample sizes, variance measures, statistical significance tests, or error bars. This omission makes it impossible to determine whether the difference is robust or merely an artifact of the chosen judge model and prompt.
[§3.2, §4.1] §3.2 and §4.1: The evaluation pipeline relies on LLM-as-a-judge scores both to validate generated data and to compute the primary quality metric. Because generation itself uses a modified Self-Instruct LLM pipeline, this creates a closed loop; no external human judgment or downstream task correlation is provided to test whether the synthetic data reproduces the same difficulty ordering or error patterns as human-curated benchmarks.
[§4.2] §4.2: The claim that synthetic datasets are of “comparable quality for comprehensive assessment of big and small LLMs” rests on the untested assumption that LLM-as-a-judge scores detect no new distributional biases relative to real data. No ablation or comparison is shown that measures how model rankings or per-category performance shift when switching from real to synthetic test sets.

minor comments (3)

[Abstract] The abstract and introduction would benefit from explicit enumeration of the languages and domains used in the reported experiments.
[§3.3] Notation for the statistical metrics in the evaluation pipeline should be defined more precisely, including formulas or pseudocode for how they are aggregated with the LLM judge scores.
[§4] Figure captions and axis labels in the results section should indicate the exact judge model and prompt template used for scoring.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify the presentation and strengthen the empirical claims in our work on STELLAR-E. We address each major comment point by point below, committing to revisions that improve statistical rigor and transparency while preserving the core automated framework.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Results): The central +5.7% average difference in LLM-as-a-judge scores is reported without sample sizes, variance measures, statistical significance tests, or error bars. This omission makes it impossible to determine whether the difference is robust or merely an artifact of the chosen judge model and prompt.

Authors: We agree that the current reporting of the +5.7% average improvement lacks sufficient statistical detail. In the revised manuscript we will explicitly state the number of synthetic datasets generated per language/domain, the number of evaluation runs, standard deviations, and error bars. We will also add paired statistical significance tests (e.g., Wilcoxon signed-rank or t-tests) across the language-specific benchmarks to demonstrate that the observed difference is robust rather than an artifact of a single judge model or prompt. revision: yes
Referee: [§3.2, §4.1] §3.2 and §4.1: The evaluation pipeline relies on LLM-as-a-judge scores both to validate generated data and to compute the primary quality metric. Because generation itself uses a modified Self-Instruct LLM pipeline, this creates a closed loop; no external human judgment or downstream task correlation is provided to test whether the synthetic data reproduces the same difficulty ordering or error patterns as human-curated benchmarks.

Authors: We acknowledge the risk of circularity when LLM-based methods are used for both data generation and quality scoring. The generation stage modifies TGRT Self-Instruct for controllable, domain-adaptable output, while the evaluation stage combines statistical metrics (diversity, coherence, length distribution) with LLM-as-a-judge scoring. To mitigate the concern, we will expand §4.1 to include a limited human validation study on a random subset of synthetic examples and report Spearman correlations between LLM-as-a-judge scores and human ratings. We will also add a direct comparison of difficulty ordering by running the same suite of LLMs on both synthetic and real benchmarks and tabulating agreement in per-model error patterns. revision: partial
Referee: [§4.2] §4.2: The claim that synthetic datasets are of “comparable quality for comprehensive assessment of big and small LLMs” rests on the untested assumption that LLM-as-a-judge scores detect no new distributional biases relative to real data. No ablation or comparison is shown that measures how model rankings or per-category performance shift when switching from real to synthetic test sets.

Authors: The manuscript already evaluates both large and small LLMs on the synthetic datasets and notes that real data remain slightly more challenging, particularly for smaller models. To directly address the potential for undetected biases, we will add an ablation subsection in §4.2 that reports model rankings and per-category accuracy shifts when the same models are tested on matched real versus synthetic sets. This will quantify any re-ranking or category-specific divergence introduced by the synthetic data. revision: yes

Circularity Check

0 steps flagged

No significant circularity in evaluation of synthetic dataset quality

full rationale

The paper's core claim is an empirical measurement: synthetic datasets produced via a modified Self-Instruct pipeline exhibit an average +5.7% difference in LLM-as-a-judge scores relative to existing language-specific benchmarks. This difference is obtained by applying the judge model and statistical metrics to both the generated data and the reference benchmarks, yielding a direct, non-tautological comparison rather than any definitional equivalence or fitted input renamed as a prediction. No equations, uniqueness theorems, or self-citations from overlapping authors are invoked to force the quality conclusion; the evaluation pipeline remains independent of the generation process in its reported metrics. The methodology is therefore self-contained as an automated, scalable benchmarking procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unproven premise that LLM-as-a-judge scores serve as a faithful proxy for dataset quality and that the modified Self-Instruct process produces representative data without new biases.

axioms (1)

domain assumption LLM-as-a-judge scores are a reliable proxy for the quality and applicability of synthetic datasets in LLM evaluations
Invoked to interpret the +5.7% difference as evidence of comparable quality.

pith-pipeline@v0.9.0 · 5555 in / 1401 out tokens · 51244 ms · 2026-05-08T03:34:41.805719+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Benchagents: Automated benchmark creation with agent interaction, 2024

Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, and Vidhisha Balachandran. Benchagents: Automated benchmark creation with agent interaction, 2024. URLhttps: //arxiv.org/abs/2410.22584

work page arXiv 2024
[2]

Yu, Qiang Yang, and Xing Xie

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models, 2023. URL https://arxiv.org/abs/2307.03109

work page arXiv 2023
[3]

arXiv preprint arXiv:2502.17521 , year=

Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. Recent advances in large langauge model benchmarks against data contamination: From static to dynamic evaluation, 2025. URL https://arxiv.org/abs/2502.17521

work page arXiv 2025
[4]

Do llm evaluators prefer themselves for a reason?, 2025

Wei-Lin Chen, Zhepei Wei, Xinyu Zhu, Shi Feng, and Yu Meng. Do llm evaluators prefer themselves for a reason?, 2025. URLhttps://arxiv.org/abs/2504.03846

work page arXiv 2025
[5]

Augmenting anonymized data with ai: Exploring the feasibility and limitations of large language models in data enrichment, 2025

Stefano Cirillo, Domenico Desiato, Giuseppe Polese, Monica Maria Lucia Sebillo, and Gian- domenico Solimando. Augmenting anonymized data with ai: Exploring the feasibility and limitations of large language models in data enrichment, 2025. URLhttps://arxiv.org/ abs/2504.03778

work page arXiv 2025
[6]

InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Barcelona Spain, 2024-08-25)

Sunhao Dai, Chen Xu, Shicheng Xu, Liang Pang, Zhenhua Dong, and Jun Xu. Bias and unfairness in information retrieval systems: New challenges in the llm era. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, pages 6437–6447. ACM, August 2024. doi: 10.1145/3637528.3671458. URLhttp://dx.doi. org/10.1145/3637528.3671458

work page doi:10.1145/3637528.3671458 2024
[7]

RAGAs: Automated evaluation of retrieval augmented generation

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. In Nikolaos Aletras and Orphee De Clercq, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta, March
[8]

RAGAs: Automated Evaluation of Retrieval Augmented Generation

Association for Computational Linguistics. doi: 10.18653/v1/2024.eacl-demo.16. URL https://aclanthology.org/2024.eacl-demo.16/

work page doi:10.18653/v1/2024.eacl-demo.16 2024
[9]

Chain-of- thought tuning: Masked language models can also think step by step in natural language understanding, 2023

Caoyun Fan, Jidong Tian, Yitian Li, Wenqing Chen, Hao He, and Yaohui Jin. Chain-of- thought tuning: Masked language models can also think step by step in natural language understanding, 2023. URLhttps://arxiv.org/abs/2310.11721

work page arXiv 2023
[10]

A comparative assessment of answer quality on four question answering sites.Journal of Information Science, 37(5):476–486, August 2011

Pnina Fichman. A comparative assessment of answer quality on four question answering sites.Journal of Information Science, 37(5):476–486, August 2011. ISSN 1741-6485. doi: 10.1177/0165551511415584. URLhttp://dx.doi.org/10.1177/0165551511415584. 13

work page doi:10.1177/0165551511415584 2011
[11]

Translationese in swedish novels translated from english

Martin Gellerstam. Translationese in swedish novels translated from english. 1986. URL https://api.semanticscholar.org/CorpusID:59685951

1986
[12]

Unnatural instructions: Tuning language models with (almost) no human labor, 2023

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor, 2023. URLhttp://dx.doi.org/10.18653/ v1/2023.acl-long.806

2023
[13]

Datagen: Unified synthetic dataset generation via large language models, 2025

Yue Huang, Siyuan Wu, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Xiangliang Zhang, Jianfeng Gao, Chaowei Xiao, and Lichao Sun. Datagen: Unified synthetic dataset generation via large language models, 2025. URLhttps://arxiv.org/abs/2406. 18966

2025
[14]

RE-RAG: Improving open-domain QA performance and inter- pretability with relevance estimator in retrieval-augmented generation

Kiseung Kim and Jay-Yoon Lee. RE-RAG: Improving open-domain QA performance and inter- pretability with relevance estimator in retrieval-augmented generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22149–22161, Miami, Florida, USA, Novem- be...

work page doi:10.18653/v1/2024.emnlp-main.1236 2024
[15]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods, 2024. URL https://arxiv.org/abs/2412.05579

work page internal anchor Pith review arXiv 2024
[16]

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021. URLhttps://arxiv.org/abs/2107.13586

work page arXiv 2021
[17]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023. URLhttps://arxiv.org/ abs/2303.16634

work page internal anchor Pith review arXiv 2023
[18]

Datasets for large language models: A comprehensive survey, 2024

Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. Datasets for large language models: A comprehensive survey, 2024. URLhttps://arxiv.org/abs/2402.18041

work page arXiv 2024
[19]

On llms-driven synthetic data generation, curation, and evaluation: A survey, 2024

Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On llms-driven synthetic data generation, curation, and evaluation: A survey, 2024. URL https://arxiv.org/abs/2406.15126

work page arXiv 2024
[20]

Synthetic data generation using large language models: Advances in text and code.IEEE Access, 13:134615–134633, 2025

Mihai Nadˇ a¸ s, Laura Dio¸ san, and Andreea Tomescu. Synthetic data generation using large language models: Advances in text and code.IEEE Access, 13:134615–134633, 2025. ISSN 2169-3536. doi: 10.1109/access.2025.3589503. URLhttp://dx.doi.org/10.1109/ACCESS. 2025.3589503

work page doi:10.1109/access.2025.3589503 2025
[21]

Zhang, Mark Harman, and Meng Wang

Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology, 34(2):1–28, January 2025. ISSN 1557-7392. doi: 10.1145/3697010. URL http://dx.doi.org/10.1145/3697010

work page doi:10.1145/3697010 2025
[22]

Transitioning from mlops to llmops: Navigating the unique challenges of large language models.Information, 16(2):87, 2025

Saurabh Pahune and Zahid Akhtar. Transitioning from mlops to llmops: Navigating the unique challenges of large language models.Information, 16(2):87, 2025

2025
[23]

Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations, 2024. URLhttps://arxiv.org/abs/2404.13076. 14

work page arXiv 2024
[24]

How to get your llm to generate challenging problems for evaluation, 2025

Arkil Patel, Siva Reddy, and Dzmitry Bahdanau. How to get your llm to generate challenging problems for evaluation, 2025. URLhttps://arxiv.org/abs/2502.14678

work page arXiv 2025
[25]

Scientific Reports15(1), 13755 (2025)

Mubashar Raza, Zarmina Jahangir, Muhammad Bilal Riaz, Muhammad Jasim Saeed, and Muhammad Awais Sattar. Industrial applications of large language models.Scientific Reports, 15(1), April 2025. ISSN 2045-2322. doi: 10.1038/s41598-025-98483-1. URLhttp://dx.doi. org/10.1038/s41598-025-98483-1

work page doi:10.1038/s41598-025-98483-1 2025
[26]

Mintaka: A complex, natural, and multilin- gual dataset for end-to-end question answering, 2022

Priyanka Sen, Alham Fikri Aji, and Amir Saffari. Mintaka: A complex, natural, and multilin- gual dataset for end-to-end question answering, 2022. URLhttps://arxiv.org/abs/2210. 01613

2022
[27]

Yourbench: Easy custom evaluation sets for everyone, 2025

Sumuk Shashidhar, Cl´ ementine Fourrier, Alina Lozovskia, Thomas Wolf, Gokhan Tur, and Dilek Hakkani-T¨ ur. Yourbench: Easy custom evaluation sets for everyone, 2025. URLhttps: //arxiv.org/abs/2504.01833

work page arXiv 2025
[28]

Shivalika Singh, Angelika Romanou, Cl´ ementine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Su- santo, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Sebastian Ruder, Madeline Smith, An- toine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante,...

work page arXiv 2025
[29]

Thibault Sellam, Dipanjan Das, and Ankur Parikh

Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are incon- sistent and biased evaluators, 2024. URLhttps://arxiv.org/abs/2405.01724

work page arXiv 2024
[30]

Principle-driven self-alignment of language models from scratch with minimal human supervision

Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023. URLhttps://arxiv.org/abs/2305.03047

work page arXiv 2023
[31]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca, 2023

2023
[32]

arXiv preprint arXiv:2412.13018 (2024)

Shuting Wang, Jiejun Tan, Zhicheng Dou, and Ji-Rong Wen. Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain, 2025. URLhttps://arxiv. org/abs/2412.13018

work page arXiv 2025
[33]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions, 2023. URLhttps://arxiv.org/abs/2212.10560

work page internal anchor Pith review arXiv 2023
[34]

Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, and Tomas Pfister

Zifeng Wang, Chun-Liang Li, Vincent Perot, Long T. Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, and Tomas Pfister. Codeclm: Aligning language models with tailored synthetic data,
[35]

URLhttps://arxiv.org/abs/2404.05875

work page arXiv
[36]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022. URLhttps://arxiv.org/abs/2206.07682. 15

work page internal anchor Pith review arXiv 2022
[37]

Explicit diversity con- ditions for effective question answer generation with large language models, 2024

Vikas Yadav, Hyuk Joon Kwon, Vijay Srinivasan, and Hongxia Jin. Explicit diversity con- ditions for effective question answer generation with large language models, 2024. URL https://arxiv.org/abs/2406.17990

work page arXiv 2024
[38]

On adversarial robustness and out-of-distribution robustness of large language models, 2024

April Yang, Jordan Tab, Parth Shah, and Paul Kotchavong. On adversarial robustness and out-of-distribution robustness of large language models, 2024. URLhttps://arxiv.org/abs/ 2412.10535

work page arXiv 2024
[39]

Revisiting out-of-distribution robustness in nlp: Benchmark, analysis, and llms evaluations, 2023

Lifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, and Maosong Sun. Revisiting out-of-distribution robustness in nlp: Benchmark, analysis, and llms evaluations, 2023. URLhttps://arxiv.org/abs/2306.04618

work page arXiv 2023
[40]

Don’t make your llm an evaluation benchmark cheater,

Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. Don’t make your llm an evaluation benchmark cheater,
[41]

URLhttps://arxiv.org/abs/2311.01964. 16 A G-Eval Prompts A.1 Prompt for Topics Generation p _ t e m p l a t e = """ <| s t a r t _ h e a d e r _ i d | > system <| e n d _ h e a d e r _ i d | > Please follow my i n s t r u c t i o n very c a r e f u l l y to g ene ra te 20 diverse topics for a s pec if ic q ue sti on type . Here are the r e q u i r e m e n t s :

work page arXiv
[42]

Try not to repeat the words for each topic to m ax imi ze d i v e r s i t y
[43]

Each topic must contain three words maximum
[45]

Topics must be 20 in total for each que st ion type , always
[46]

Each topic should be a noun phrase , and its first word should be c a p i t a l i z e d
[47]

The topics should be closely related to the given qu es tio n type
[48]

topics

Output your answer in json format like this : {{ " topics ": A JSON list of 20 topics related to the given que st ion type , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | >
[49]

"" A.2 Auxiliary Prompt for Topic Evaluation When Filtering Topics p _ t o p i c _ t e m p l a t e _ n o _ f o r m a t t i n g =

The list of topics must be a JSON list , not s u r r o u n d e d by quotes , just by square bra ck et s . <| s t a r t _ h e a d e r _ i d | > user <| e n d _ h e a d e r _ i d | > The qu es tio n type is : { q u e s t i o n _ t y p e } <| eot_id | > <| s t a r t _ h e a d e r _ i d | > assistant <| e n d _ h e a d e r _ i d | >{ f o r m a t _ i n s t r u...
[50]

The topic must contain three words maximum
[51]

Topics are not questions , just general topics
[52]

A topic should be a noun phrase , and its first word should be c a p i t a l i z e d
[53]

The topic should be closely related to the given qu est io n type
[54]

"" A.3 Prompt for All Stages of Evaluation p _ t e m p l a t e _ f o r _ e v a l u a t i o n =

Output your answer in json format like this : {{ " topic ": One topic related to the given qu es tio n type , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | > <| s t a r t _ h e a d e r _ i d | > user <| e n d _ h e a d e r _ i d | > The qu es tio n type is : { q u e s t i o n _ t y p e } <| eot_id | > <| s t a r t _ h e a d e r _ i d | > as...
[55]

Try not to repeat the words for each i n s t r u c t i o n to max im iz e d i v e r s i t y
[56]

For example , you should combine q u e s t i o n s with i m p e r a t i v e i n s t r u c t i o n s

The lan gu ag e used for the i n s t r u c t i o n also should be diverse . For example , you should combine q u e s t i o n s with i m p e r a t i v e i n s t r u c t i o n s
[57]

The set should include diverse types of instructions , such as : { i n s t r u c t i o n _ t y p e s }

The type of i n s t r u c t i o n s should be diverse . The set should include diverse types of instructions , such as : { i n s t r u c t i o n _ t y p e s }
[59]

Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d

Each i n s t r u c t i o n should be short and concise , as a single se nt enc e . Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d
[62]

i n s t r u c t i o n s

Output your answer in JSON format like this : {{ " i n s t r u c t i o n s ": A JSON list of { n u m b e r _ o f _ i n s t r u c t i o n s } i n s t r u c t i o n s related to the given topics , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | > <| s t a r t _ h e a d e r _ i d | > user <| e n d _ h e a d e r _ i d | > The topics are : { topic...
[63]

S y n t a c t i c a l l y speaking , the i n s t r u c t i o n can either be a qu es tio n or i m p e r a t i v e i n s t r u c t i o n s
[65]

The i n s t r u c t i o n should be in { i n s t r u c t i o n _ l a n g u a g e }
[66]

Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d

The i n s t r u c t i o n should be short and concise , as a single se nte nc e . Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d
[67]

I will give you i n s t r u c t i o n domain and topics to help you b r a i n s t o r m the i n s t r u c t i o n s
[68]

Do not escape single quotes inside the i n s t r u c t i o n

Every quote inside each i n s t r u c t i o n should be single - quoted , not double - quoted . Do not escape single quotes inside the i n s t r u c t i o n
[69]

i n s t r u c t i o n

Output your answer in JSON format like this : {{ " i n s t r u c t i o n ": One i n s t r u c t i o n related to the given topics , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | > <| s t a r t _ h e a d e r _ i d | > user <| e n d _ h e a d e r _ i d | > The topics are : { topics } The i n s t r u c t i o n domain is : { i n s t r u c t i o...
[70]

S y n t a c t i c a l l y speaking , the i n s t r u c t i o n s can either be a qu es ti on or i m p e r a t i v e i n s t r u c t i o n s
[71]

The i n s t r u c t i o n can fall in one of these types : { i n s t r u c t i o n _ t y p e s }
[73]

Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d

The i n s t r u c t i o n s should be short and concise , as a single se nt enc e . Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d
[74]

I will give you i n s t r u c t i o n s domain and topics to help you improve the i n s t r u c t i o n s
[76]

i n s t r u c t i o n s

Output your answer in JSON format like this : {{ " i n s t r u c t i o n s ": A JSON list of i mp rov ed i n s t r u c t i o n s related to the given topics , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | > The topics are : { topics } The i n s t r u c t i o n domain is : { i n s t r u c t i o n _ d o m a i n } The i n s t r u c t i o n s t...
[77]

- I n t r o d u c e a m b i g u i t y or m ult ip le i n t e r p r e t a t i o n s to the i n s t r u c t i o n s to make them more d i f f i c u l t

The d i f f i c u l t y of the i n s t r u c t i o n s should be i mpr ov ed in one or more of the f o l l o w i n g ways : 19 - P a r a p h r a s e the i n s t r u c t i o n s to make them more complex or c h a l l e n g i n g . - I n t r o d u c e a m b i g u i t y or m ult ip le i n t e r p r e t a t i o n s to the i n s t r u c t i o n s to make them ...
[78]

- Add a new p l a u s i b l e choice to the exi st in g ones , which is not the correct answer

When the i n s t r u c t i o n s have m ul ti ple choices , you must also improve the d i f f i c u l t y of the choices in one or more of the f o l l o w i n g ways : - P a r a p h r a s e the choices to make them more complex or c h a l l e n g i n g . - Add a new p l a u s i b l e choice to the exi st in g ones , which is not the correct answer
[79]

You don ’ t change the content or l ang ua ge of the instructions , just improve their d i f f i c u l t y
[80]

The i n s t r u c t i o n s should be short and concise , as a single se nt enc e
[81]

S y n t a c t i c a l l y speaking , the i n s t r u c t i o n s can either be q u e s t i o n s or i m p e r a t i v e i n s t r u c t i o n s
[82]

Just improve the pr ovi de d ones

Do not output more i n s t r u c t i o n s than the p ro vid ed ones . Just improve the pr ovi de d ones
[83]

The i n s t r u c t i o n s should be in { i n s t r u c t i o n _ l a n g u a g e }
[84]

i n s t r u c t i o n s

Output your answer in JSON format like this : {{ " i n s t r u c t i o n s ": A JSON list of d i f f i c u l t y im pr ov ed instructions , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | > The i n s t r u c t i o n s to improve are : { i n s t r u c t i o n s } <| eot_id | > Json Output : """ A.8 Prompt for Single Answer Generation p _ t e m...
[85]

The answer must be s e m a n t i c a l l y correct for the given i n s t r u c t i o n
[86]

The answer must be s y n t a c t i c a l l y correct for the given i n s t r u c t i o n
[87]

In case the i n s t r u c t i o n s ask about s o m e t h i n g personal , simply state that you don ’ t know the answer

Showing first 80 references.