SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
Pith reviewed 2026-05-22 06:21 UTC · model grok-4.3
The pith
SynAE shows that synthetic data for tool-calling agents needs checks across validity, fidelity, and diversity rather than any single metric.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SynAE assesses the validity, fidelity, and diversity of synthetic data for multi-turn tool-calling agents across four metric categories: task instructions and intermediate responses, tool calls, final outputs, and downstream evaluation. When tested on recent agent benchmarks and data produced by realistic controlled generation schemes that simulate common failure modes, SynAE detects fine-grained variations in these dimensions and demonstrates that no single metric fully characterizes synthetic data quality, motivating multi-axis evaluation instead.
What carries the argument
SynAE evaluation framework that applies separate metric sets to validity, fidelity, and diversity in the four categories of instructions, tool calls, outputs, and downstream results.
If this is right
- Synthetic data generators can be adjusted to fix specific weaknesses in validity or diversity once SynAE identifies them.
- Agent evaluations become more reliable when the test data passes checks on all four metric categories.
- Different synthetic data methods can be compared directly by their scores on the same multi-axis set.
- Practitioners can decide whether a given synthetic dataset is sufficient for pre-deployment testing by inspecting its profile across the axes.
Where Pith is reading between the lines
- The same multi-axis approach could be adapted to judge synthetic data for other agent types such as web navigation or code generation agents.
- Widespread adoption of SynAE-style checks might reduce the volume of real user data needed for testing and thereby improve privacy protections during agent development.
- Automated pipelines could optimize new synthetic data generators to maximize scores on the full set of SynAE metrics.
Load-bearing premise
The controlled and realistic generation schemes used to create test synthetic data accurately represent the common failure modes that occur when practitioners generate synthetic data for tool-calling agent evaluations in production settings.
What would settle it
A collection of synthetic datasets in which one metric correlates perfectly with all other validity, fidelity, diversity, and downstream performance measures would falsify the claim that multiple axes are required.
Figures
read the original abstract
Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae-2026-synae-demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SynAE, a framework for assessing synthetic data quality for multi-turn tool-calling agent evaluations. It defines metrics for validity, fidelity, and diversity across four categories—task instructions and intermediate responses, tool calls, final outputs, and downstream evaluation—and evaluates the framework on recent agent benchmarks by applying controlled generation schemes that inject common failure modes. The central result is that SynAE detects fine-grained variations in these dimensions and that no single metric suffices to characterize synthetic data quality, motivating multi-axis evaluation.
Significance. If the evaluation holds, SynAE offers a practical, multi-dimensional tool for practitioners who must rely on synthetic data when real execution traces are sparse or sensitive. The public demo and GitHub code are explicit strengths that support reproducibility and adoption. The work directly addresses a growing need in agent benchmarking and could influence how synthetic datasets are validated before deployment.
major comments (1)
- [§4] §4 (Evaluation and Experiments): The claim that the controlled generation schemes are 'realistic' and capture common failure modes is load-bearing for the recommendation of multi-axis evaluation in practice. The manuscript does not report a side-by-side statistical comparison (e.g., distribution of tool-call validity rates, response diversity scores) between the injected schemes and synthetic data produced by standard LLM pipelines with typical prompting and temperature settings. Without this, it remains unclear whether SynAE’s sensitivity generalizes beyond the experimental construction.
minor comments (2)
- [Abstract] Abstract: The phrase 'recent agent benchmarks' is used without naming the specific datasets or citations; adding the exact benchmark names would improve traceability.
- [Results] Figure 3 (or equivalent results figure): The color scale and legend for the multi-metric heatmaps are difficult to read at standard print size; increasing font size or adding a supplementary table of raw values would aid interpretation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive overall assessment of SynAE. We agree that additional evidence supporting the realism of the controlled generation schemes would strengthen the practical implications of the multi-axis evaluation recommendation. We will incorporate the suggested comparison in the revised manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation and Experiments): The claim that the controlled generation schemes are 'realistic' and capture common failure modes is load-bearing for the recommendation of multi-axis evaluation in practice. The manuscript does not report a side-by-side statistical comparison (e.g., distribution of tool-call validity rates, response diversity scores) between the injected schemes and synthetic data produced by standard LLM pipelines with typical prompting and temperature settings. Without this, it remains unclear whether SynAE’s sensitivity generalizes beyond the experimental construction.
Authors: We acknowledge the value of a direct statistical comparison to demonstrate that the injected failure modes align with those arising from standard LLM-based synthetic data generation. The schemes in the current manuscript were constructed from failure modes documented in prior agent evaluation literature and observed in our preliminary experiments with production-style traces. To address the concern, we will add a new analysis in the revised Section 4 (or an appendix) that generates parallel synthetic datasets using common LLM pipelines (e.g., zero-shot prompting with temperature 0.7 and 1.0) on the same underlying tasks. We will then report side-by-side distributions for key metrics such as tool-call validity rates and response diversity scores, allowing readers to assess how closely the controlled schemes match typical synthetic outputs. revision: yes
Circularity Check
SynAE metrics defined independently; no reduction to inputs by construction
full rationale
The paper defines SynAE as a multi-category metric framework (task instructions/responses, tool calls, final outputs, downstream evaluation) for validity/fidelity/diversity and applies it empirically to synthetic data created with injected failure modes via controlled generation schemes on existing agent benchmarks. The central result—that fine-grained variations are detected and no single metric suffices—is an observation from these experiments rather than a quantity fitted from the test data or derived tautologically from the same definitions. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation; the framework functions as an external assessment tool whose outputs are not forced by its own construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Real production datasets for tool-calling agents are often insufficient or unusable due to sensitivity or sparsity.
- domain assumption Validity, fidelity, and diversity are the appropriate high-level dimensions for characterizing synthetic data quality in multi-turn tool-calling scenarios.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Evaluating the evaluations: A perspective on benchmarks
Omar Alonso and Kenneth Church. Evaluating the evaluations: A perspective on benchmarks. InACM SIGIR Forum, volume 58, pages 1–27. ACM New York, NY, USA, 2025
work page 2025
-
[2]
Demystifying evals for ai agents
Anthropic. Demystifying evals for ai agents. https://www.anthropic.com/engineering/ demystifying-evals-for-ai-agents, 2026
work page 2026
-
[3]
CapitalOne. Synthetic data matters for machine learning innovation.https://www.capitalone.com/ tech/machine-learning/synthetic-data-research/, 2022
work page 2022
-
[4]
Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, and Genta Indra Winata. T1: A tool-oriented conversational dataset for multi-turn agentic planning.arXiv preprint arXiv:2505.16986, 2025
-
[5]
What are specialized task ai agents? benefits, features & use cases explained
Enkrypt AI. What are specialized task ai agents? benefits, features & use cases explained. Enkrypt AI Blog (Guest Post), March 2024
work page 2024
-
[6]
The vendi score: A diversity evaluation metric for machine learning
Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022
-
[7]
What has been lost with synthetic evaluation?arXiv preprint arXiv:2505.22830, 2025
Alexander Gill, Abhilasha Ravichander, and Ana Marasović. What has been lost with synthetic evaluation?arXiv preprint arXiv:2505.22830, 2025
-
[8]
Evaluation gaps in machine learning practice
Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prabhakaran. Evaluation gaps in machine learning practice. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency, pages 1859–1876, 2022
work page 2022
-
[9]
Quality matters: Evaluating synthetic data for tool-using llms
Shadi Iskander, Sofia Tolmach, Ori Shapira, Nachshon Cohen, and Zohar Karnin. Quality matters: Evaluating synthetic data for tool-using llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4958–4976, 2024
work page 2024
-
[10]
Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025
-
[11]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al. Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025
-
[13]
Acpbench: Reasoning about action, change, and planning
Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi. Acpbench: Reasoning about action, change, and planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 26559–26568, 2025
work page 2025
-
[14]
Toolrm: Towards agentic tool-use reward modeling.arXiv preprint arXiv:2510.26167, 2025
Renhao Li, Jianhong Tu, Yang Su, Yantao Liu, Fei Huang, Hamid Alinejad-Rokny, Derek F Wong, Junyang Lin, and Min Yang. Toolrm: Towards agentic tool-use reward modeling.arXiv preprint arXiv:2510.26167, 2025
-
[15]
Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures, pages 100–114, 2022
work page 2022
-
[16]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agentrewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025
-
[18]
Efficacy of synthetic data as a benchmark
Gaurav Maheshwari, Dmitry Ivanov, and Kevin El Haddad. Efficacy of synthetic data as a benchmark. arXiv preprint arXiv:2409.11968, 2024
-
[19]
Michael Majurski and Cynthia Matuszek. Grounding synthetic data evaluations of language models in unsupervised document corpora.arXiv preprint arXiv:2505.08905, 2025
-
[20]
What are vertical ai agents? IBM Think, n.d
Amanda McGrath and Amanda Downie. What are vertical ai agents? IBM Think, n.d
-
[21]
Sohum Mehta and Saaketh Bhojanam. Prompt genotyping: Quantifying the evaluation gap between synthetic benchmarks and real llm performance. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
work page 2025
-
[22]
Evaluation and benchmarking of llm agents: A survey
Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmarking of llm agents: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 6129–6139, 2025
work page 2025
-
[23]
NVIDIA NeMo.https://www.nvidia.com/en-us/ai-data-science/products/nemo/
NVIDIA. NVIDIA NeMo.https://www.nvidia.com/en-us/ai-data-science/products/nemo/
-
[24]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[25]
Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024
Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024
-
[26]
Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025
Melissa Z Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, et al. Measuring agents in production.arXiv preprint arXiv:2512.04123, 2025
-
[27]
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning
-
[28]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Assisting in writing wikipedia-like articles from scratch with large language models
Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. Assisting in writing wikipedia-like articles from scratch with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6252–6278, 2024
work page 2024
-
[30]
Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Yueting Zhuang. Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024
work page 2024
- [31]
-
[32]
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, et al. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models. InConference on Neural Information Processing Systems, 2023
work page 2023
-
[34]
Struct-bench: A benchmark for differentially private structured text generation
Shuaiqi Wang, Vikas Raunak, Arturs Backurs, Victor Reis, Pei Zhou, Sihao Chen, Longqi Yang, Zinan Lin, Sergey Yekhanin, and Giulia Fanti. Struct-bench: A benchmark for differentially private structured text generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track
-
[35]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. Livebench: A challenging, contamination-limited llm benchmark.arXiv preprint arXiv:2406.19314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Self-evolved diverse data sampling for efficient instruction tuning,
Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, and Chang Zhou. Self-evolved diverse data sampling for efficient instruction tuning.arXiv preprint arXiv:2311.08182, 2023
-
[37]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024
work page 2024
-
[38]
Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness
Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, and Kevin Zhu. Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness. arXiv preprint arXiv:2509.00591, 2025
-
[39]
Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025
-
[40]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
work page 2024
-
[41]
John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023
work page 2023
-
[42]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
work page 2022
-
[44]
Survey on Evaluation of LLM-based Agents
Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. Survey on evaluation of llm-based agents.arXiv preprint arXiv:2503.16416, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Gecko: A simulation environment to ground agent tool calls with stateful feedback for refinement
Zeyu Zhang, Guohao Li, Zhenchang Xing, Alexandros Apostolopoulos, Yu Lin Lee, and Liang Zheng. Gecko: A simulation environment to ground agent tool calls with stateful feedback for refinement
-
[46]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023
-
[48]
Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, et al. Establishing best practices for building rigorous agentic benchmarks.arXiv preprint arXiv:2507.02825, 2025. 13
-
[49]
Kaijian Zou, Muhammad Khalifa, and Lu Wang. On many-shot in-context learning for long-context evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25605–25639, 2025. 14 A Related Work Robust benchmarks for interactive tool-use are necessary both for generalist agents (e.g., co...
work page 2025
-
[50]
The c o n v e r s a t i o n MUST start with ’ a ss ist an t : ’ ( not ’ A ssi st an t : ’ or any v ari at io n )
-
[51]
Lines MUST al ter na te strictly between ’ user : ’ and ’ a ss ist an t : ’
-
[52]
Each line must follow the format : ’ role : content ’ where role is either ’ user ’ or ’ assistant ’
-
[53]
Output ONLY the c om pl ete d c o n v e r s a t i o n with no preamble , explanation , or extra text
-
[54]
Maintain the same number of c o n v e r s a t i o n turns as the input User Prompt: 1Example input for fill in the blanks : 2 3as sis ta nt : H_____ What ____ of a t t r a c t i o n s are you looking for ? Are you i n t e r e s t e d in _______ , a__ , or s om eth in g else ? 4user : I ’ m i n t e r e s t e d in ___ and ____ a t t r a c t i o n s in __ . ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.