Recognition: unknown
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
Pith reviewed 2026-05-08 03:34 UTC · model grok-4.3
The pith
STELLAR-E creates synthetic datasets for LLM evaluation that differ by only 5.7 percent from real benchmarks on average.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STELLAR-E is a fully automated two-stage system that modifies the TGRT Self-Instruct framework to generate custom synthetic datasets of any size and then uses an evaluation pipeline with statistical and LLM-as-a-judge metrics to confirm their quality; these datasets show an average difference of +5.7% in LLM-as-a-judge scores compared to language-specific benchmarks, indicating they are suitable for comprehensive LLM application assessment.
What carries the argument
The two-stage structure: a synthetic data engine based on modified Self-Instruct for controllable generation and an evaluation pipeline that combines statistical metrics with LLM-based scoring to assess dataset applicability.
Load-bearing premise
LLM-as-a-judge scores along with the chosen statistical metrics provide a reliable indication of the synthetic datasets' quality for real-world use and do not add biases not present in human-curated data.
What would settle it
A study showing that LLM rankings or performance assessments differ substantially when using the synthetic datasets versus traditional benchmarks, or human review revealing quality issues missed by the automated metrics.
Figures
read the original abstract
The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents STELLAR-E, a two-stage automated framework for generating custom-sized synthetic evaluation datasets for LLMs without relying on existing data. Stage 1 modifies the TGRT Self-Instruct pipeline for controllable, domain- and language-adaptable generation; Stage 2 applies an evaluation pipeline combining statistical metrics and LLM-as-a-judge scoring. The central empirical claim is that the resulting synthetic datasets achieve an average +5.7% higher LLM-as-a-judge score than existing language-specific benchmarks, supporting their use for comprehensive assessment of both large and small LLMs while noting that real datasets remain slightly more challenging.
Significance. If the core claim can be externally validated, the work would offer a practical, scalable alternative to labor-intensive manual benchmark creation, directly addressing privacy, regulatory, and cost barriers in multilingual and domain-specific LLM evaluation. The emphasis on minimal human input and end-to-end automation is a strength for rapid iteration in application-specific testing.
major comments (3)
- [Abstract, §4] Abstract and §4 (Results): The central +5.7% average difference in LLM-as-a-judge scores is reported without sample sizes, variance measures, statistical significance tests, or error bars. This omission makes it impossible to determine whether the difference is robust or merely an artifact of the chosen judge model and prompt.
- [§3.2, §4.1] §3.2 and §4.1: The evaluation pipeline relies on LLM-as-a-judge scores both to validate generated data and to compute the primary quality metric. Because generation itself uses a modified Self-Instruct LLM pipeline, this creates a closed loop; no external human judgment or downstream task correlation is provided to test whether the synthetic data reproduces the same difficulty ordering or error patterns as human-curated benchmarks.
- [§4.2] §4.2: The claim that synthetic datasets are of “comparable quality for comprehensive assessment of big and small LLMs” rests on the untested assumption that LLM-as-a-judge scores detect no new distributional biases relative to real data. No ablation or comparison is shown that measures how model rankings or per-category performance shift when switching from real to synthetic test sets.
minor comments (3)
- [Abstract] The abstract and introduction would benefit from explicit enumeration of the languages and domains used in the reported experiments.
- [§3.3] Notation for the statistical metrics in the evaluation pipeline should be defined more precisely, including formulas or pseudocode for how they are aggregated with the LLM judge scores.
- [§4] Figure captions and axis labels in the results section should indicate the exact judge model and prompt template used for scoring.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which help clarify the presentation and strengthen the empirical claims in our work on STELLAR-E. We address each major comment point by point below, committing to revisions that improve statistical rigor and transparency while preserving the core automated framework.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Results): The central +5.7% average difference in LLM-as-a-judge scores is reported without sample sizes, variance measures, statistical significance tests, or error bars. This omission makes it impossible to determine whether the difference is robust or merely an artifact of the chosen judge model and prompt.
Authors: We agree that the current reporting of the +5.7% average improvement lacks sufficient statistical detail. In the revised manuscript we will explicitly state the number of synthetic datasets generated per language/domain, the number of evaluation runs, standard deviations, and error bars. We will also add paired statistical significance tests (e.g., Wilcoxon signed-rank or t-tests) across the language-specific benchmarks to demonstrate that the observed difference is robust rather than an artifact of a single judge model or prompt. revision: yes
-
Referee: [§3.2, §4.1] §3.2 and §4.1: The evaluation pipeline relies on LLM-as-a-judge scores both to validate generated data and to compute the primary quality metric. Because generation itself uses a modified Self-Instruct LLM pipeline, this creates a closed loop; no external human judgment or downstream task correlation is provided to test whether the synthetic data reproduces the same difficulty ordering or error patterns as human-curated benchmarks.
Authors: We acknowledge the risk of circularity when LLM-based methods are used for both data generation and quality scoring. The generation stage modifies TGRT Self-Instruct for controllable, domain-adaptable output, while the evaluation stage combines statistical metrics (diversity, coherence, length distribution) with LLM-as-a-judge scoring. To mitigate the concern, we will expand §4.1 to include a limited human validation study on a random subset of synthetic examples and report Spearman correlations between LLM-as-a-judge scores and human ratings. We will also add a direct comparison of difficulty ordering by running the same suite of LLMs on both synthetic and real benchmarks and tabulating agreement in per-model error patterns. revision: partial
-
Referee: [§4.2] §4.2: The claim that synthetic datasets are of “comparable quality for comprehensive assessment of big and small LLMs” rests on the untested assumption that LLM-as-a-judge scores detect no new distributional biases relative to real data. No ablation or comparison is shown that measures how model rankings or per-category performance shift when switching from real to synthetic test sets.
Authors: The manuscript already evaluates both large and small LLMs on the synthetic datasets and notes that real data remain slightly more challenging, particularly for smaller models. To directly address the potential for undetected biases, we will add an ablation subsection in §4.2 that reports model rankings and per-category accuracy shifts when the same models are tested on matched real versus synthetic sets. This will quantify any re-ranking or category-specific divergence introduced by the synthetic data. revision: yes
Circularity Check
No significant circularity in evaluation of synthetic dataset quality
full rationale
The paper's core claim is an empirical measurement: synthetic datasets produced via a modified Self-Instruct pipeline exhibit an average +5.7% difference in LLM-as-a-judge scores relative to existing language-specific benchmarks. This difference is obtained by applying the judge model and statistical metrics to both the generated data and the reference benchmarks, yielding a direct, non-tautological comparison rather than any definitional equivalence or fitted input renamed as a prediction. No equations, uniqueness theorems, or self-citations from overlapping authors are invoked to force the quality conclusion; the evaluation pipeline remains independent of the generation process in its reported metrics. The methodology is therefore self-contained as an automated, scalable benchmarking procedure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-as-a-judge scores are a reliable proxy for the quality and applicability of synthetic datasets in LLM evaluations
Reference graph
Works this paper leans on
-
[1]
Benchagents: Automated benchmark creation with agent interaction, 2024
Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, and Vidhisha Balachandran. Benchagents: Automated benchmark creation with agent interaction, 2024. URLhttps: //arxiv.org/abs/2410.22584
-
[2]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models, 2023. URL https://arxiv.org/abs/2307.03109
-
[3]
arXiv preprint arXiv:2502.17521 , year=
Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. Recent advances in large langauge model benchmarks against data contamination: From static to dynamic evaluation, 2025. URL https://arxiv.org/abs/2502.17521
-
[4]
Do llm evaluators prefer themselves for a reason?, 2025
Wei-Lin Chen, Zhepei Wei, Xinyu Zhu, Shi Feng, and Yu Meng. Do llm evaluators prefer themselves for a reason?, 2025. URLhttps://arxiv.org/abs/2504.03846
-
[5]
Stefano Cirillo, Domenico Desiato, Giuseppe Polese, Monica Maria Lucia Sebillo, and Gian- domenico Solimando. Augmenting anonymized data with ai: Exploring the feasibility and limitations of large language models in data enrichment, 2025. URLhttps://arxiv.org/ abs/2504.03778
-
[6]
Sunhao Dai, Chen Xu, Shicheng Xu, Liang Pang, Zhenhua Dong, and Jun Xu. Bias and unfairness in information retrieval systems: New challenges in the llm era. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, pages 6437–6447. ACM, August 2024. doi: 10.1145/3637528.3671458. URLhttp://dx.doi. org/10.1145/3637528.3671458
-
[7]
RAGAs: Automated evaluation of retrieval augmented generation
Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. In Nikolaos Aletras and Orphee De Clercq, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta, March
-
[8]
RAGAs: Automated Evaluation of Retrieval Augmented Generation
Association for Computational Linguistics. doi: 10.18653/v1/2024.eacl-demo.16. URL https://aclanthology.org/2024.eacl-demo.16/
-
[9]
Caoyun Fan, Jidong Tian, Yitian Li, Wenqing Chen, Hao He, and Yaohui Jin. Chain-of- thought tuning: Masked language models can also think step by step in natural language understanding, 2023. URLhttps://arxiv.org/abs/2310.11721
-
[10]
Pnina Fichman. A comparative assessment of answer quality on four question answering sites.Journal of Information Science, 37(5):476–486, August 2011. ISSN 1741-6485. doi: 10.1177/0165551511415584. URLhttp://dx.doi.org/10.1177/0165551511415584. 13
-
[11]
Translationese in swedish novels translated from english
Martin Gellerstam. Translationese in swedish novels translated from english. 1986. URL https://api.semanticscholar.org/CorpusID:59685951
1986
-
[12]
Unnatural instructions: Tuning language models with (almost) no human labor, 2023
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor, 2023. URLhttp://dx.doi.org/10.18653/ v1/2023.acl-long.806
2023
-
[13]
Datagen: Unified synthetic dataset generation via large language models, 2025
Yue Huang, Siyuan Wu, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Xiangliang Zhang, Jianfeng Gao, Chaowei Xiao, and Lichao Sun. Datagen: Unified synthetic dataset generation via large language models, 2025. URLhttps://arxiv.org/abs/2406. 18966
2025
-
[14]
Kiseung Kim and Jay-Yoon Lee. RE-RAG: Improving open-domain QA performance and inter- pretability with relevance estimator in retrieval-augmented generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22149–22161, Miami, Florida, USA, Novem- be...
-
[15]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods, 2024. URL https://arxiv.org/abs/2412.05579
work page internal anchor Pith review arXiv 2024
-
[16]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021. URLhttps://arxiv.org/abs/2107.13586
-
[17]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023. URLhttps://arxiv.org/ abs/2303.16634
work page internal anchor Pith review arXiv 2023
-
[18]
Datasets for large language models: A comprehensive survey, 2024
Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. Datasets for large language models: A comprehensive survey, 2024. URLhttps://arxiv.org/abs/2402.18041
-
[19]
On llms-driven synthetic data generation, curation, and evaluation: A survey, 2024
Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On llms-driven synthetic data generation, curation, and evaluation: A survey, 2024. URL https://arxiv.org/abs/2406.15126
-
[20]
Mihai Nadˇ a¸ s, Laura Dio¸ san, and Andreea Tomescu. Synthetic data generation using large language models: Advances in text and code.IEEE Access, 13:134615–134633, 2025. ISSN 2169-3536. doi: 10.1109/access.2025.3589503. URLhttp://dx.doi.org/10.1109/ACCESS. 2025.3589503
-
[21]
Zhang, Mark Harman, and Meng Wang
Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology, 34(2):1–28, January 2025. ISSN 1557-7392. doi: 10.1145/3697010. URL http://dx.doi.org/10.1145/3697010
-
[22]
Transitioning from mlops to llmops: Navigating the unique challenges of large language models.Information, 16(2):87, 2025
Saurabh Pahune and Zahid Akhtar. Transitioning from mlops to llmops: Navigating the unique challenges of large language models.Information, 16(2):87, 2025
2025
-
[23]
Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Yang Wang
Arjun Panickssery, Samuel R. Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations, 2024. URLhttps://arxiv.org/abs/2404.13076. 14
-
[24]
How to get your llm to generate challenging problems for evaluation, 2025
Arkil Patel, Siva Reddy, and Dzmitry Bahdanau. How to get your llm to generate challenging problems for evaluation, 2025. URLhttps://arxiv.org/abs/2502.14678
-
[25]
Scientific Reports15(1), 13755 (2025)
Mubashar Raza, Zarmina Jahangir, Muhammad Bilal Riaz, Muhammad Jasim Saeed, and Muhammad Awais Sattar. Industrial applications of large language models.Scientific Reports, 15(1), April 2025. ISSN 2045-2322. doi: 10.1038/s41598-025-98483-1. URLhttp://dx.doi. org/10.1038/s41598-025-98483-1
-
[26]
Mintaka: A complex, natural, and multilin- gual dataset for end-to-end question answering, 2022
Priyanka Sen, Alham Fikri Aji, and Amir Saffari. Mintaka: A complex, natural, and multilin- gual dataset for end-to-end question answering, 2022. URLhttps://arxiv.org/abs/2210. 01613
2022
-
[27]
Yourbench: Easy custom evaluation sets for everyone, 2025
Sumuk Shashidhar, Cl´ ementine Fourrier, Alina Lozovskia, Thomas Wolf, Gokhan Tur, and Dilek Hakkani-T¨ ur. Yourbench: Easy custom evaluation sets for everyone, 2025. URLhttps: //arxiv.org/abs/2504.01833
-
[28]
Shivalika Singh, Angelika Romanou, Cl´ ementine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Su- santo, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Sebastian Ruder, Madeline Smith, An- toine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante,...
-
[29]
Thibault Sellam, Dipanjan Das, and Ankur Parikh
Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are incon- sistent and biased evaluators, 2024. URLhttps://arxiv.org/abs/2405.01724
-
[30]
Principle-driven self-alignment of language models from scratch with minimal human supervision
Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023. URLhttps://arxiv.org/abs/2305.03047
-
[31]
Hashimoto
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca, 2023
2023
-
[32]
arXiv preprint arXiv:2412.13018 (2024)
Shuting Wang, Jiejun Tan, Zhicheng Dou, and Ji-Rong Wen. Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain, 2025. URLhttps://arxiv. org/abs/2412.13018
-
[33]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions, 2023. URLhttps://arxiv.org/abs/2212.10560
work page internal anchor Pith review arXiv 2023
-
[34]
Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, and Tomas Pfister
Zifeng Wang, Chun-Liang Li, Vincent Perot, Long T. Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, and Tomas Pfister. Codeclm: Aligning language models with tailored synthetic data,
- [35]
-
[36]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022. URLhttps://arxiv.org/abs/2206.07682. 15
work page internal anchor Pith review arXiv 2022
-
[37]
Vikas Yadav, Hyuk Joon Kwon, Vijay Srinivasan, and Hongxia Jin. Explicit diversity con- ditions for effective question answer generation with large language models, 2024. URL https://arxiv.org/abs/2406.17990
-
[38]
On adversarial robustness and out-of-distribution robustness of large language models, 2024
April Yang, Jordan Tab, Parth Shah, and Paul Kotchavong. On adversarial robustness and out-of-distribution robustness of large language models, 2024. URLhttps://arxiv.org/abs/ 2412.10535
-
[39]
Revisiting out-of-distribution robustness in nlp: Benchmark, analysis, and llms evaluations, 2023
Lifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, and Maosong Sun. Revisiting out-of-distribution robustness in nlp: Benchmark, analysis, and llms evaluations, 2023. URLhttps://arxiv.org/abs/2306.04618
-
[40]
Don’t make your llm an evaluation benchmark cheater,
Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. Don’t make your llm an evaluation benchmark cheater,
-
[41]
URLhttps://arxiv.org/abs/2311.01964. 16 A G-Eval Prompts A.1 Prompt for Topics Generation p _ t e m p l a t e = """ <| s t a r t _ h e a d e r _ i d | > system <| e n d _ h e a d e r _ i d | > Please follow my i n s t r u c t i o n very c a r e f u l l y to g ene ra te 20 diverse topics for a s pec if ic q ue sti on type . Here are the r e q u i r e m e n t s :
-
[42]
Try not to repeat the words for each topic to m ax imi ze d i v e r s i t y
-
[43]
Each topic must contain three words maximum
-
[45]
Topics must be 20 in total for each que st ion type , always
-
[46]
Each topic should be a noun phrase , and its first word should be c a p i t a l i z e d
-
[47]
The topics should be closely related to the given qu es tio n type
-
[48]
topics
Output your answer in json format like this : {{ " topics ": A JSON list of 20 topics related to the given que st ion type , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | >
-
[49]
"" A.2 Auxiliary Prompt for Topic Evaluation When Filtering Topics p _ t o p i c _ t e m p l a t e _ n o _ f o r m a t t i n g =
The list of topics must be a JSON list , not s u r r o u n d e d by quotes , just by square bra ck et s . <| s t a r t _ h e a d e r _ i d | > user <| e n d _ h e a d e r _ i d | > The qu es tio n type is : { q u e s t i o n _ t y p e } <| eot_id | > <| s t a r t _ h e a d e r _ i d | > assistant <| e n d _ h e a d e r _ i d | >{ f o r m a t _ i n s t r u...
-
[50]
The topic must contain three words maximum
-
[51]
Topics are not questions , just general topics
-
[52]
A topic should be a noun phrase , and its first word should be c a p i t a l i z e d
-
[53]
The topic should be closely related to the given qu est io n type
-
[54]
"" A.3 Prompt for All Stages of Evaluation p _ t e m p l a t e _ f o r _ e v a l u a t i o n =
Output your answer in json format like this : {{ " topic ": One topic related to the given qu es tio n type , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | > <| s t a r t _ h e a d e r _ i d | > user <| e n d _ h e a d e r _ i d | > The qu es tio n type is : { q u e s t i o n _ t y p e } <| eot_id | > <| s t a r t _ h e a d e r _ i d | > as...
-
[55]
Try not to repeat the words for each i n s t r u c t i o n to max im iz e d i v e r s i t y
-
[56]
For example , you should combine q u e s t i o n s with i m p e r a t i v e i n s t r u c t i o n s
The lan gu ag e used for the i n s t r u c t i o n also should be diverse . For example , you should combine q u e s t i o n s with i m p e r a t i v e i n s t r u c t i o n s
-
[57]
The set should include diverse types of instructions , such as : { i n s t r u c t i o n _ t y p e s }
The type of i n s t r u c t i o n s should be diverse . The set should include diverse types of instructions , such as : { i n s t r u c t i o n _ t y p e s }
-
[59]
Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d
Each i n s t r u c t i o n should be short and concise , as a single se nt enc e . Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d
-
[62]
i n s t r u c t i o n s
Output your answer in JSON format like this : {{ " i n s t r u c t i o n s ": A JSON list of { n u m b e r _ o f _ i n s t r u c t i o n s } i n s t r u c t i o n s related to the given topics , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | > <| s t a r t _ h e a d e r _ i d | > user <| e n d _ h e a d e r _ i d | > The topics are : { topic...
-
[63]
S y n t a c t i c a l l y speaking , the i n s t r u c t i o n can either be a qu es tio n or i m p e r a t i v e i n s t r u c t i o n s
-
[65]
The i n s t r u c t i o n should be in { i n s t r u c t i o n _ l a n g u a g e }
-
[66]
Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d
The i n s t r u c t i o n should be short and concise , as a single se nte nc e . Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d
-
[67]
I will give you i n s t r u c t i o n domain and topics to help you b r a i n s t o r m the i n s t r u c t i o n s
-
[68]
Do not escape single quotes inside the i n s t r u c t i o n
Every quote inside each i n s t r u c t i o n should be single - quoted , not double - quoted . Do not escape single quotes inside the i n s t r u c t i o n
-
[69]
i n s t r u c t i o n
Output your answer in JSON format like this : {{ " i n s t r u c t i o n ": One i n s t r u c t i o n related to the given topics , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | > <| s t a r t _ h e a d e r _ i d | > user <| e n d _ h e a d e r _ i d | > The topics are : { topics } The i n s t r u c t i o n domain is : { i n s t r u c t i o...
-
[70]
S y n t a c t i c a l l y speaking , the i n s t r u c t i o n s can either be a qu es ti on or i m p e r a t i v e i n s t r u c t i o n s
-
[71]
The i n s t r u c t i o n can fall in one of these types : { i n s t r u c t i o n _ t y p e s }
-
[73]
Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d
The i n s t r u c t i o n s should be short and concise , as a single se nt enc e . Either an i m p e r a t i v e se nt enc e or a qu es tio n is p e r m i t t e d
-
[74]
I will give you i n s t r u c t i o n s domain and topics to help you improve the i n s t r u c t i o n s
-
[76]
i n s t r u c t i o n s
Output your answer in JSON format like this : {{ " i n s t r u c t i o n s ": A JSON list of i mp rov ed i n s t r u c t i o n s related to the given topics , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | > The topics are : { topics } The i n s t r u c t i o n domain is : { i n s t r u c t i o n _ d o m a i n } The i n s t r u c t i o n s t...
-
[77]
- I n t r o d u c e a m b i g u i t y or m ult ip le i n t e r p r e t a t i o n s to the i n s t r u c t i o n s to make them more d i f f i c u l t
The d i f f i c u l t y of the i n s t r u c t i o n s should be i mpr ov ed in one or more of the f o l l o w i n g ways : 19 - P a r a p h r a s e the i n s t r u c t i o n s to make them more complex or c h a l l e n g i n g . - I n t r o d u c e a m b i g u i t y or m ult ip le i n t e r p r e t a t i o n s to the i n s t r u c t i o n s to make them ...
-
[78]
- Add a new p l a u s i b l e choice to the exi st in g ones , which is not the correct answer
When the i n s t r u c t i o n s have m ul ti ple choices , you must also improve the d i f f i c u l t y of the choices in one or more of the f o l l o w i n g ways : - P a r a p h r a s e the choices to make them more complex or c h a l l e n g i n g . - Add a new p l a u s i b l e choice to the exi st in g ones , which is not the correct answer
-
[79]
You don ’ t change the content or l ang ua ge of the instructions , just improve their d i f f i c u l t y
-
[80]
The i n s t r u c t i o n s should be short and concise , as a single se nt enc e
-
[81]
S y n t a c t i c a l l y speaking , the i n s t r u c t i o n s can either be q u e s t i o n s or i m p e r a t i v e i n s t r u c t i o n s
-
[82]
Just improve the pr ovi de d ones
Do not output more i n s t r u c t i o n s than the p ro vid ed ones . Just improve the pr ovi de d ones
-
[83]
The i n s t r u c t i o n s should be in { i n s t r u c t i o n _ l a n g u a g e }
-
[84]
i n s t r u c t i o n s
Output your answer in JSON format like this : {{ " i n s t r u c t i o n s ": A JSON list of d i f f i c u l t y im pr ov ed instructions , f o l l o w i n g the r e q u i r e m e n t s }} <| eot_id | > The i n s t r u c t i o n s to improve are : { i n s t r u c t i o n s } <| eot_id | > Json Output : """ A.8 Prompt for Single Answer Generation p _ t e m...
-
[85]
The answer must be s e m a n t i c a l l y correct for the given i n s t r u c t i o n
-
[86]
The answer must be s y n t a c t i c a l l y correct for the given i n s t r u c t i o n
-
[87]
In case the i n s t r u c t i o n s ask about s o m e t h i n g personal , simply state that you don ’ t know the answer
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.