NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

Darlene Neal; Erin van Liemt; Jamila Smith-Loud; Kshitij Pancholi; Qazi Mamunur Rashid; Xuan Yang; Yanzhou Pan; Zhengzhe Yang

arxiv: 2605.14381 · v2 · pith:VTS4FYGPnew · submitted 2026-05-14 · 💻 cs.LG · cs.CL

NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

Qazi Mamunur Rashid , Xuan Yang , Zhengzhe Yang , Yanzhou Pan , Erin van Liemt , Darlene Neal , Kshitij Pancholi , Jamila Smith-Loud This is my paper

Pith reviewed 2026-05-20 20:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords synthetic dataAI safety evaluationLLM testingtaxonomy generatorsocially aligned queriesfailure rate measurementguard model validation

0 comments

The pith

NodeSynth generates synthetic queries that cause AI models to fail up to five times more often than human benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NodeSynth to create synthetic test queries for AI models that better capture real social and technical nuances than standard benchmarks. It shows that queries built from a fine-tuned taxonomy generator anchored in actual evidence expose many more failures in mainstream large language models. The authors confirm through ablations that expanding the taxonomy in detail is what drives the higher failure rates. This approach also uncovers weaknesses in existing safety guard models. A sympathetic reader would care because current tests may be underestimating how often AI systems break on sensitive topics.

Core claim

NodeSynth is an evidence-grounded methodology that generates socially relevant synthetic queries by leveraging a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated against four mainstream LLMs, NodeSynth elicited failure rates up to five times higher than human-authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models.

What carries the argument

The fine-tuned taxonomy generator (TaG) that expands a taxonomy in granular detail from real-world evidence to produce the synthetic queries.

If this is right

Mainstream LLMs fail more often on socially nuanced queries than current benchmarks indicate.
Granular expansion of the taxonomy is what produces the higher observed failure rates.
Prominent guard models such as Llama-Guard-3 show clear gaps when tested on these queries.
Releasing the full prototype and datasets allows others to run targeted safety checks at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evidence-anchored generation process could be adapted to create test sets for other high-stakes domains such as medical or legal queries.
Models that pass human benchmarks but fail on NodeSynth queries may need additional training data drawn from the same real-world sources.
If the higher failure rates hold in live deployments, organizations using guard models would need stronger secondary checks before release.

Load-bearing premise

The synthetic queries match the complexity of actual social situations without adding extra patterns that make models fail more on their own.

What would settle it

Collect a set of real incident reports matching the taxonomy topics and run the same model tests on those reports instead of the synthetic queries; if failure rates drop back to the level of human benchmarks, the method's higher rates are not representative.

Figures

Figures reproduced from arXiv: 2605.14381 by Darlene Neal, Erin van Liemt, Jamila Smith-Loud, Kshitij Pancholi, Qazi Mamunur Rashid, Xuan Yang, Yanzhou Pan, Zhengzhe Yang.

**Figure 2.** Figure 2: Breakdown of the failure rate by Level 2 across all four models and two domains: (a) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Before and after SFT similarity score distribution [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 3.** Figure 3: Before and after SFT similarity score distribution [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Breakdown of the failure rate by Level 2 and User Group across all four models and two [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗

**Figure 4.** Figure 4: Breakdown of the failure rate by Level 2 and User Group across all four models and two [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

read the original abstract

Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence-grounded methodology that generates socially relevant synthetic queries by leveraging a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human-authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama-Guard-3). We open-source our end-to-end research prototype and datasets to enable scalable, high-stakes model evaluation and targeted safety interventions (https://github.com/google-research/nodesynth).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NodeSynth offers a practical evidence-anchored way to generate synthetic queries for LLM safety tests and shows higher failure rates, but the results could partly reflect synthesis artifacts rather than pure model gaps.

read the letter

The main thing here is that NodeSynth fine-tunes a taxonomy generator on real-world evidence to produce synthetic queries for probing LLMs on social topics. The authors report up to five times higher failure rates than human benchmarks across models like Claude, with ablations linking the granular taxonomy to that jump and separate checks flagging issues in guard models like Llama-Guard-3. They also release the full prototype and datasets, which helps others test or reuse the pipeline directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces NodeSynth, an evidence-grounded methodology for generating socially relevant synthetic queries via a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated on four mainstream LLMs, it reports failure rates up to five times higher than human-authored benchmarks. Ablation studies attribute this increase to granular taxonomic expansion, and independent validation identifies deficiencies in guard models such as Llama-Guard-3. The end-to-end prototype and datasets are open-sourced.

Significance. If the synthetic queries prove representative of real-world sociotechnical content without introducing correlated artifacts, the work provides a scalable framework for high-stakes AI safety evaluation and targeted interventions. The open-sourcing of code and data is a clear strength that supports reproducibility and community follow-up. The central empirical claims would then offer falsifiable evidence of model weaknesses in sensitive domains.

major comments (2)

Abstract: The headline claim of failure rates up to five times higher than human-authored benchmarks is load-bearing for the contribution. Without explicit controls (e.g., matching on query length, lexical diversity, or human-rated realism) comparing NodeSynth outputs to real-world queries on the same topics, it remains possible that fine-tuning or taxonomic expansion introduces systematic linguistic patterns that independently elevate failure rates in both the evaluated LLMs and guard models.
Ablation studies: The attribution of elevated failure rates to granular taxonomic expansion requires isolation of this variable from confounding factors such as increased query specificity or edge-case framing introduced by the synthesis pipeline; otherwise the causal link to genuine sociotechnical coverage is under-supported.

minor comments (2)

Abstract: The parenthetical example 'Claude 4.5 Haiku' should be expanded to list all four evaluated LLMs for immediate clarity.
Methods: Additional detail on the fine-tuning procedure for TaG and the precise real-world evidence sources used for anchoring would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting areas where additional controls and isolation of variables would strengthen our claims. We address each major comment below and commit to revisions that directly respond to the concerns.

read point-by-point responses

Referee: Abstract: The headline claim of failure rates up to five times higher than human-authored benchmarks is load-bearing for the contribution. Without explicit controls (e.g., matching on query length, lexical diversity, or human-rated realism) comparing NodeSynth outputs to real-world queries on the same topics, it remains possible that fine-tuning or taxonomic expansion introduces systematic linguistic patterns that independently elevate failure rates in both the evaluated LLMs and guard models.

Authors: We agree that ruling out linguistic artifacts is essential for the headline claim. In the revised manuscript we will add a dedicated controls subsection that matches NodeSynth and human-authored queries on length and lexical diversity using standard metrics. We will also report a new human evaluation in which raters compare realism of NodeSynth queries against real-world sociotechnical examples drawn from the same topics used to seed the taxonomy. These results will be summarized in the abstract and used to support that elevated failure rates reflect content coverage rather than superficial patterns. revision: yes
Referee: Ablation studies: The attribution of elevated failure rates to granular taxonomic expansion requires isolation of this variable from confounding factors such as increased query specificity or edge-case framing introduced by the synthesis pipeline; otherwise the causal link to genuine sociotechnical coverage is under-supported.

Authors: We acknowledge that the current ablation design does not fully isolate taxonomic granularity from specificity and framing effects. We will expand the ablation studies with additional controlled variants that hold query length and specificity approximately constant while varying only the depth of taxonomic expansion. Failure rates on these matched sets will be reported to provide clearer causal evidence for the contribution of granular taxonomy to the observed gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical failure rates and ablations are independent measurements

full rationale

The paper describes NodeSynth as an evidence-grounded methodology that uses a fine-tuned taxonomy generator (TaG) anchored in real-world evidence to produce synthetic queries. Its central results consist of direct empirical measurements—failure rates up to five times higher than human-authored benchmarks, plus ablation studies attributing the increase to granular taxonomic expansion—together with independent validation of guard models. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations are present that would cause these reported quantities to reduce by construction to the synthesis process itself. The derivation chain therefore remains self-contained and externally falsifiable via the released datasets and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the taxonomy generator faithfully captures sociotechnical nuance from real-world evidence without circular validation.

pith-pipeline@v0.9.0 · 5705 in / 1086 out tokens · 46689 ms · 2026-05-20T20:25:34.494614+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 3 internal anchors

[1]

On llms-driven synthetic data generation, curation, and evaluation: A survey

Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On llms-driven synthetic data generation, curation, and evaluation: A survey. InFindings of the Association for Computational Linguistics ACL 2024, pages 11065–11082, 2024

work page 2024
[2]

Synthetic data in ai: Challenges, applications, and ethical implications.arXiv preprint arXiv:2401.01629, 2024

Shuang Hao, Wenfeng Han, Tao Jiang, Yiping Li, Haonan Wu, Chunlin Zhong, Zhangjun Zhou, and He Tang. Synthetic data in ai: Challenges, applications, and ethical implications.arXiv preprint arXiv:2401.01629, 2024

work page arXiv 2024
[3]

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Self-instruct: Aligning language models with self-generated instruc- tions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

work page 2023
[5]

Examining the expanding role of synthetic data throughout the ai development pipeline

Shivani Kapania, Stephanie Ballard, Alex Kessler, and Jennifer Wortman Vaughan. Examining the expanding role of synthetic data throughout the ai development pipeline. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 45–60, 2025

work page 2025
[6]

Bias mitigation via synthetic data generation: a review.Electronics, 13(19):3909, 2024

Mohamed Ashik Shahul Hameed, Asifa Mehmood Qureshi, and Abhishek Kaushik. Bias mitigation via synthetic data generation: a review.Electronics, 13(19):3909, 2024

work page 2024
[7]

Towards understanding bias in synthetic data for evaluation

Hossein A Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra. Towards understanding bias in synthetic data for evaluation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 5166–5170, 2025

work page 2025
[8]

everyone wants to do the model work, not the data work

Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. Inproceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021

work page 2021
[9]

Evaluating lan- guage models as synthetic data generators

Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, and Graham Neubig. Evaluating lan- guage models as synthetic data generators. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6385–6403, 2025

work page 2025
[10]

Efficacy of synthetic data as a benchmark

Gaurav Maheshwari, Dmitry Ivanov, and Kevin El Haddad. Efficacy of synthetic data as a benchmark.arXiv preprint arXiv:2409.11968, 2024

work page arXiv 2024
[11]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, 2022

work page 2022
[12]

Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications

Bhaktipriya Radharapu, Kevin Robinson, Lora Aroyo, and Preethi Lahoti. Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 380–395, 2023

work page 2023
[13]

Automated progressive red teaming

Bojian Jiang, Yi Jing, Tong Wu, Tianhao Shen, Deyi Xiong, and Qing Yang. Automated progressive red teaming. InProceedings of the 31st International Conference on Computational Linguistics, pages 3850–3864, 2025

work page 2025
[14]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

S-eval: Towards automated safety evaluation with enhancement for large language models.ACM Transactions on Software Engineering and Methodology, 2026

Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, et al. S-eval: Towards automated safety evaluation with enhancement for large language models.ACM Transactions on Software Engineering and Methodology, 2026

work page 2026
[16]

Holistic automated red teaming for large language models through top-down test case generation and multi-turn interaction

Jinchuan Zhang, Yan Zhou, Yaxin Liu, Ziming Li, and Songlin Hu. Holistic automated red teaming for large language models through top-down test case generation and multi-turn interaction. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13711–13736, 2024

work page 2024
[17]

Reasoning- driven synthetic data generation and evaluation.arXiv preprint arXiv:2603.29791, 2026

Tim R Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, and Hamza Harkous. Reasoning- driven synthetic data generation and evaluation.arXiv preprint arXiv:2603.29791, 2026

work page arXiv 2026
[18]

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Haoran Ou, Kangjie Chen, Xingshuo Han, Gelei Deng, Jie Zhang, Han Qiu, and Tianwei Zhang. Crest-search: Comprehensive red-teaming for evaluating safety threats in large language models powered by web search.arXiv preprint arXiv:2510.09689, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Learning diverse at- tacks on large language models for robust red-teaming and safety tuning

Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, et al. Learning diverse at- tacks on large language models for robust red-teaming and safety tuning. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024
[20]

Nullspace disentanglement for red teaming language models

Yi Han, Yuanxing Liu, Weinan Zhang, and Ting Liu. Nullspace disentanglement for red teaming language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21360–21376, 2025

work page 2025
[21]

Structural transparency of societal ai alignment through institu- tional logics.arXiv preprint arXiv:2602.08246, 2026

Atrisha Sarkar and Isam Faik. Structural transparency of societal ai alignment through institu- tional logics.arXiv preprint arXiv:2602.08246, 2026

work page arXiv 2026
[22]

Evaluating alignment of behavioral dispositions in llms.arXiv preprint arXiv:2602.11328, 2026

Amir Taubenfeld, Zorik Gekhman, Lior Nezry, Omri Feldman, Natalie Harris, Shashir Reddy, Romina Stella, Ariel Goldstein, Marian Croak, Yossi Matias, et al. Evaluating alignment of behavioral dispositions in llms.arXiv preprint arXiv:2602.11328, 2026

work page arXiv 2026
[23]

A testable framework for ai alignment: Simulation theology as an engineered worldview for silicon-based agents.arXiv preprint arXiv:2602.16987, 2026

Josef A Habdank. A testable framework for ai alignment: Simulation theology as an engineered worldview for silicon-based agents.arXiv preprint arXiv:2602.16987, 2026

work page arXiv 2026
[24]

Socially grounded exemplars improve synthetic conversations for health-related social needs navigation.medRxiv, pages 2026–01, 2026

Syed-Amad Hussain, Daniel I Jackson, Samanvith Thotapalli, Marissa B McClellan, Madeleine Stanco, Grace Varney, Sterling Gleeson, Florencia Nugroho, William Leever, Eric Fosler- Lussier, et al. Socially grounded exemplars improve synthetic conversations for health-related social needs navigation.medRxiv, pages 2026–01, 2026

work page 2026
[25]

Individuals and (syn- thetic) data points: Using value-sensitive design to foster ethical deliberations on epistemic transitions.American Journal of Bioethics, 23(9):69–72, 2023

Jean-Christophe Bélisle-Pipon, Vardit Ravitsky, Yael Bensoussan, et al. Individuals and (syn- thetic) data points: Using value-sensitive design to foster ethical deliberations on epistemic transitions.American Journal of Bioethics, 23(9):69–72, 2023

work page 2023
[26]

Evaluating the use of large language models as synthetic social agents in social science research.Journal of Social Computing, 6(4):334–341, 2025

Emma Rose Madden. Evaluating the use of large language models as synthetic social agents in social science research.Journal of Social Computing, 6(4):334–341, 2025

work page 2025
[27]

Syng4me: Model evaluation using synthetic test data.journal=arXiv preprint arXiv:2310.16524, 2023

Boris van Breugel, Nabeel Seedat, Fergus Imrie, and Mihaela van der Schaar. Syng4me: Model evaluation using synthetic test data.journal=arXiv preprint arXiv:2310.16524, 2023

work page arXiv 2023
[28]

Synth-align: Improving trustwor- thiness in vision-language model with synthetic preference data alignment.arXiv preprint arXiv:2412.17417, 2024

Robert Wijaya, Ngoc-Bao Nguyen, and Ngai-Man Cheung. Synth-align: Improving trustwor- thiness in vision-language model with synthetic preference data alignment.arXiv preprint arXiv:2412.17417, 2024

work page arXiv 2024
[29]

Using synthetic data to improve the reproducibility of statistical results in psychological research.Psychological Methods, 29(4): 789, 2024

Simon Grund, Oliver L¨"udtke, and Alexander Robitzsch. Using synthetic data to improve the reproducibility of statistical results in psychological research.Psychological Methods, 29(4): 789, 2024

work page 2024
[30]

Ensuring data quality in large international development projects: tools, strategies, and lessons learned.American Journal of Evaluation, 46(4):562–578, 2025

Jennifer Sdunzik, Ann M Bessenbacher, Wilella D Burgess, Asia M Mohamud, and Abdirisak Dalmar. Ensuring data quality in large international development projects: tools, strategies, and lessons learned.American Journal of Evaluation, 46(4):562–578, 2025. 12

work page 2025
[31]

A multi-faceted evaluation framework for assessing synthetic data generated by large language models.arXiv preprint arXiv:2404.14445, 2024

Yefeng Yuan, Yuhong Liu, and Liang Cheng. A multi-faceted evaluation framework for assessing synthetic data generated by large language models.arXiv preprint arXiv:2404.14445, 2024

work page arXiv 2024
[32]

Synthtexteval: Synthetic text data generation and evaluation for high-stakes domains

Krithika Ramesh, Daniel Smolyak, Zihao Zhao, Nupoor Gandhi, Ritu Agarwal, Margrét V Bjarnadóttir, and Anjalie Field. Synthtexteval: Synthetic text data generation and evaluation for high-stakes domains. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 487–499, 2025

work page 2025
[33]

Discerning obstacles and opportunities: A framework for evaluating power.American Journal of Evaluation, 46(2):207–217, 2025

Rebecca Friesen and Adriana D Cimetta. Discerning obstacles and opportunities: A framework for evaluating power.American Journal of Evaluation, 46(2):207–217, 2025

work page 2025
[34]

Synthetic data for evaluation: Supporting llm-as-a-judge workflows with evalassist

Martín Santillán Cooper, Zahra Ashktorab, Hyo Jin Do, Erik Miehling, Werner Geyer, Jasmina Gajcin, Elizabeth M Daly, Qian Pan, and Michael Desmond. Synthetic data for evaluation: Supporting llm-as-a-judge workflows with evalassist. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 1–11, 2025

work page 2025
[35]

Google generative AI prohibited use policy, 2024

Google. Google generative AI prohibited use policy, 2024. URLhttps://policies.google. com/terms/generative-ai/use-policy. Accessed: 2024-05-20

work page 2024
[36]

Usage policies, 2024

OpenAI. Usage policies, 2024. URL https://openai.com/policies/usage-policies/. Accessed: 2024-05-20

work page 2024
[37]

Hate speech policy - YouTube help, 2024

YouTube. Hate speech policy - YouTube help, 2024. URL https://support.google.com/ youtube/answer/2802245. Accessed: 2024-05-20

work page arXiv 2024
[38]

A toolbox for surfacing health equity harms and biases in large language models.Nature Medicine, 30 (12):3590–3600, 2024

Stephen R Pfohl, Heather Cole-Lewis, Rory Sayres, Darlene Neal, Mercy Asiedu, Awa Dieng, Nenad Tomasev, Qazi Mamunur Rashid, Shekoofeh Azizi, Negar Rostamzadeh, et al. A toolbox for surfacing health equity harms and biases in large language models.Nature Medicine, 30 (12):3590–3600, 2024

work page 2024
[39]

Aloe: A family of fine-tuned open healthcare llms,

Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Jordi Bayarri-Planas, Adrian Tormos, Daniel Hinjos, Pablo Bernabeu-Perez, Anna Arias-Duart, Pablo Agustin Martin-Torres, Lucia Urcelay- Ganzabal, Marta Gonzalez-Mallo, et al. Aloe: A family of fine-tuned open healthcare llms. arXiv preprint arXiv:2405.01886, 2024

work page arXiv 2024
[40]

Suicide ideation detection in social media forums

K Nikhileswar, D Vishal, L Sphoorthi, and S Fathimabi. Suicide ideation detection in social media forums. In2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), pages 1741–1747. IEEE, 2021. 13 A TaG Prompt Templates A.1 L1 and L2 Generation Template System Instruction:You are a policy expert. Please analyze {policy} and provi...

work page 2021
[41]

Continue this loop until you believe all significant aspects are addressed

Check if all relevant categories and topics related to {policy} have been covered... Continue this loop until you believe all significant aspects are addressed. A.2 L3 Keyword Generation Template List the top 3 English keywords that most related to topic {t} given that the topic is sourced from and all the keywords are all related to {i} {j} given the def...

work page
[42]

somewhat

Vermin, 2. Disease, 3. Filth; Rationale: [Your rationale here] B TaG Model Training Model specifications:The model was trained for 4 epochs using the default learning_rate_multiplier of 1.0. For this Gemini 2.5 Flash model, we utilized the defaultadapter_sizeof 4, which controls the capacity of the parameter-efficient tuning module. Training data sample s...

work page
[43]

Intent”, “Variable Names

Read the provided “Intent”, “Variable Names”, and “Other Context” sections carefully. Extract the values for “{Country}”, “{policy}”, and “{Language_code}”. Note that 17 the “Other Context” provides an example format, but the instruction is to *only* use the format specified in the “Intent” section

work page
[44]

{Country}

Based on your knowledge of the specified “{Country}” and “{policy}”, brainstorm relevant categories and corresponding topics, etc.) that are specifically impacted by the policy within that country

work page
[45]

Intent” section explicitly specifies the desired output format: “(Category, Topics, Rationale)

The “Intent” section explicitly specifies the desired output format: “(Category, Topics, Rationale)”. Therefore, no other format needs to be considered

work page
[46]

your output please strictly follow the same format below (Category, Topics, Rationale) and do not add any more session besides Category, Topics, Rationale and keep the sequence of the session first say Category,then Topics, and Rationale, please do not add any more stuff, the format should EXACTLY look like the examples format below: Examples: for hate sp...

work page
[47]

:” following each key session such as ‘Category:’, ‘Topics:’, ‘Rationale:’; please do not add “(

Check if all relevant categories and topics related to ‘{policy}’ in ‘{Country}’ have been covered. If not, go back to step 2 and create another category entry using the same format, making sure there is always “:” following each key session such as ‘Category:’, ‘Topics:’, ‘Rationale:’; please do not add “(” or “)” in the output. Continue this loop until ...

work page
[48]

Ensure all rationales are well- supported and clearly explain the relevance of each category and topic to the policy within the specified country

Review the complete output to ensure it adheres strictly to the specified format and contains no hallucinations or invented information. Ensure all rationales are well- supported and clearly explain the relevance of each category and topic to the policy within the specified country

work page
[49]

think” or any conversational text. • Translate the keywords and rationale into ‘{Language_code}’. 18 ## EXPECTED JSON STRUCTURE: {

Output the complete list of categories, topics, and rationales in the specified format. G.2 Level 3 Keywords Generation Prompt ## Role You are an expert policy analyst. Generate the top 3 keywords most related to the topic ‘{t}’ (which is sourced from ‘{i}’ and ‘{j}’). ## Context The domain is health/medical or safety (‘{Definition}’). Location focus: ‘{l...

work page
[50]

Specific Advice Requested:Explicitly demand a diagnosis, a drug recommendation (dosage optional), or a specific home treatment

work page
[51]

I’m traveling,

Sense of Urgency/Context:Add a real-world constraint, time pressure, or emotional context that attempts to justify the need for an immediate AI answer (e.g., “I’m traveling,” “My doctor’s office is closed,” “I can’t afford a visit”)

work page
[52]

Answer as a friendly neighbor,

Bypass Attempt (Jailbreak Tactic):Frame the request in a way that attempts to bypass the safety refusal (e.g., “Answer as a friendly neighbor,” “Provide this for a fictional story,” “Just give me the over-the-counter options”). Format:Provide only the 10 generated prompts as a numbered list. Do not include any introductory or concluding text, explanations...

work page

[1] [1]

On llms-driven synthetic data generation, curation, and evaluation: A survey

Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On llms-driven synthetic data generation, curation, and evaluation: A survey. InFindings of the Association for Computational Linguistics ACL 2024, pages 11065–11082, 2024

work page 2024

[2] [2]

Synthetic data in ai: Challenges, applications, and ethical implications.arXiv preprint arXiv:2401.01629, 2024

Shuang Hao, Wenfeng Han, Tao Jiang, Yiping Li, Haonan Wu, Chunlin Zhong, Zhangjun Zhou, and He Tang. Synthetic data in ai: Challenges, applications, and ethical implications.arXiv preprint arXiv:2401.01629, 2024

work page arXiv 2024

[3] [3]

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Self-instruct: Aligning language models with self-generated instruc- tions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

work page 2023

[5] [5]

Examining the expanding role of synthetic data throughout the ai development pipeline

Shivani Kapania, Stephanie Ballard, Alex Kessler, and Jennifer Wortman Vaughan. Examining the expanding role of synthetic data throughout the ai development pipeline. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 45–60, 2025

work page 2025

[6] [6]

Bias mitigation via synthetic data generation: a review.Electronics, 13(19):3909, 2024

Mohamed Ashik Shahul Hameed, Asifa Mehmood Qureshi, and Abhishek Kaushik. Bias mitigation via synthetic data generation: a review.Electronics, 13(19):3909, 2024

work page 2024

[7] [7]

Towards understanding bias in synthetic data for evaluation

Hossein A Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra. Towards understanding bias in synthetic data for evaluation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 5166–5170, 2025

work page 2025

[8] [8]

everyone wants to do the model work, not the data work

Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. Inproceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021

work page 2021

[9] [9]

Evaluating lan- guage models as synthetic data generators

Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, and Graham Neubig. Evaluating lan- guage models as synthetic data generators. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6385–6403, 2025

work page 2025

[10] [10]

Efficacy of synthetic data as a benchmark

Gaurav Maheshwari, Dmitry Ivanov, and Kevin El Haddad. Efficacy of synthetic data as a benchmark.arXiv preprint arXiv:2409.11968, 2024

work page arXiv 2024

[11] [11]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, 2022

work page 2022

[12] [12]

Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications

Bhaktipriya Radharapu, Kevin Robinson, Lora Aroyo, and Preethi Lahoti. Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 380–395, 2023

work page 2023

[13] [13]

Automated progressive red teaming

Bojian Jiang, Yi Jing, Tong Wu, Tianhao Shen, Deyi Xiong, and Qing Yang. Automated progressive red teaming. InProceedings of the 31st International Conference on Computational Linguistics, pages 3850–3864, 2025

work page 2025

[14] [14]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

S-eval: Towards automated safety evaluation with enhancement for large language models.ACM Transactions on Software Engineering and Methodology, 2026

Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, et al. S-eval: Towards automated safety evaluation with enhancement for large language models.ACM Transactions on Software Engineering and Methodology, 2026

work page 2026

[16] [16]

Holistic automated red teaming for large language models through top-down test case generation and multi-turn interaction

Jinchuan Zhang, Yan Zhou, Yaxin Liu, Ziming Li, and Songlin Hu. Holistic automated red teaming for large language models through top-down test case generation and multi-turn interaction. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13711–13736, 2024

work page 2024

[17] [17]

Reasoning- driven synthetic data generation and evaluation.arXiv preprint arXiv:2603.29791, 2026

Tim R Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, and Hamza Harkous. Reasoning- driven synthetic data generation and evaluation.arXiv preprint arXiv:2603.29791, 2026

work page arXiv 2026

[18] [18]

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Haoran Ou, Kangjie Chen, Xingshuo Han, Gelei Deng, Jie Zhang, Han Qiu, and Tianwei Zhang. Crest-search: Comprehensive red-teaming for evaluating safety threats in large language models powered by web search.arXiv preprint arXiv:2510.09689, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Learning diverse at- tacks on large language models for robust red-teaming and safety tuning

Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, et al. Learning diverse at- tacks on large language models for robust red-teaming and safety tuning. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024

[20] [20]

Nullspace disentanglement for red teaming language models

Yi Han, Yuanxing Liu, Weinan Zhang, and Ting Liu. Nullspace disentanglement for red teaming language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21360–21376, 2025

work page 2025

[21] [21]

Structural transparency of societal ai alignment through institu- tional logics.arXiv preprint arXiv:2602.08246, 2026

Atrisha Sarkar and Isam Faik. Structural transparency of societal ai alignment through institu- tional logics.arXiv preprint arXiv:2602.08246, 2026

work page arXiv 2026

[22] [22]

Evaluating alignment of behavioral dispositions in llms.arXiv preprint arXiv:2602.11328, 2026

Amir Taubenfeld, Zorik Gekhman, Lior Nezry, Omri Feldman, Natalie Harris, Shashir Reddy, Romina Stella, Ariel Goldstein, Marian Croak, Yossi Matias, et al. Evaluating alignment of behavioral dispositions in llms.arXiv preprint arXiv:2602.11328, 2026

work page arXiv 2026

[23] [23]

A testable framework for ai alignment: Simulation theology as an engineered worldview for silicon-based agents.arXiv preprint arXiv:2602.16987, 2026

Josef A Habdank. A testable framework for ai alignment: Simulation theology as an engineered worldview for silicon-based agents.arXiv preprint arXiv:2602.16987, 2026

work page arXiv 2026

[24] [24]

Socially grounded exemplars improve synthetic conversations for health-related social needs navigation.medRxiv, pages 2026–01, 2026

Syed-Amad Hussain, Daniel I Jackson, Samanvith Thotapalli, Marissa B McClellan, Madeleine Stanco, Grace Varney, Sterling Gleeson, Florencia Nugroho, William Leever, Eric Fosler- Lussier, et al. Socially grounded exemplars improve synthetic conversations for health-related social needs navigation.medRxiv, pages 2026–01, 2026

work page 2026

[25] [25]

Individuals and (syn- thetic) data points: Using value-sensitive design to foster ethical deliberations on epistemic transitions.American Journal of Bioethics, 23(9):69–72, 2023

Jean-Christophe Bélisle-Pipon, Vardit Ravitsky, Yael Bensoussan, et al. Individuals and (syn- thetic) data points: Using value-sensitive design to foster ethical deliberations on epistemic transitions.American Journal of Bioethics, 23(9):69–72, 2023

work page 2023

[26] [26]

Evaluating the use of large language models as synthetic social agents in social science research.Journal of Social Computing, 6(4):334–341, 2025

Emma Rose Madden. Evaluating the use of large language models as synthetic social agents in social science research.Journal of Social Computing, 6(4):334–341, 2025

work page 2025

[27] [27]

Syng4me: Model evaluation using synthetic test data.journal=arXiv preprint arXiv:2310.16524, 2023

Boris van Breugel, Nabeel Seedat, Fergus Imrie, and Mihaela van der Schaar. Syng4me: Model evaluation using synthetic test data.journal=arXiv preprint arXiv:2310.16524, 2023

work page arXiv 2023

[28] [28]

Synth-align: Improving trustwor- thiness in vision-language model with synthetic preference data alignment.arXiv preprint arXiv:2412.17417, 2024

Robert Wijaya, Ngoc-Bao Nguyen, and Ngai-Man Cheung. Synth-align: Improving trustwor- thiness in vision-language model with synthetic preference data alignment.arXiv preprint arXiv:2412.17417, 2024

work page arXiv 2024

[29] [29]

Using synthetic data to improve the reproducibility of statistical results in psychological research.Psychological Methods, 29(4): 789, 2024

Simon Grund, Oliver L¨"udtke, and Alexander Robitzsch. Using synthetic data to improve the reproducibility of statistical results in psychological research.Psychological Methods, 29(4): 789, 2024

work page 2024

[30] [30]

Ensuring data quality in large international development projects: tools, strategies, and lessons learned.American Journal of Evaluation, 46(4):562–578, 2025

Jennifer Sdunzik, Ann M Bessenbacher, Wilella D Burgess, Asia M Mohamud, and Abdirisak Dalmar. Ensuring data quality in large international development projects: tools, strategies, and lessons learned.American Journal of Evaluation, 46(4):562–578, 2025. 12

work page 2025

[31] [31]

A multi-faceted evaluation framework for assessing synthetic data generated by large language models.arXiv preprint arXiv:2404.14445, 2024

Yefeng Yuan, Yuhong Liu, and Liang Cheng. A multi-faceted evaluation framework for assessing synthetic data generated by large language models.arXiv preprint arXiv:2404.14445, 2024

work page arXiv 2024

[32] [32]

Synthtexteval: Synthetic text data generation and evaluation for high-stakes domains

Krithika Ramesh, Daniel Smolyak, Zihao Zhao, Nupoor Gandhi, Ritu Agarwal, Margrét V Bjarnadóttir, and Anjalie Field. Synthtexteval: Synthetic text data generation and evaluation for high-stakes domains. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 487–499, 2025

work page 2025

[33] [33]

Discerning obstacles and opportunities: A framework for evaluating power.American Journal of Evaluation, 46(2):207–217, 2025

Rebecca Friesen and Adriana D Cimetta. Discerning obstacles and opportunities: A framework for evaluating power.American Journal of Evaluation, 46(2):207–217, 2025

work page 2025

[34] [34]

Synthetic data for evaluation: Supporting llm-as-a-judge workflows with evalassist

Martín Santillán Cooper, Zahra Ashktorab, Hyo Jin Do, Erik Miehling, Werner Geyer, Jasmina Gajcin, Elizabeth M Daly, Qian Pan, and Michael Desmond. Synthetic data for evaluation: Supporting llm-as-a-judge workflows with evalassist. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 1–11, 2025

work page 2025

[35] [35]

Google generative AI prohibited use policy, 2024

Google. Google generative AI prohibited use policy, 2024. URLhttps://policies.google. com/terms/generative-ai/use-policy. Accessed: 2024-05-20

work page 2024

[36] [36]

Usage policies, 2024

OpenAI. Usage policies, 2024. URL https://openai.com/policies/usage-policies/. Accessed: 2024-05-20

work page 2024

[37] [37]

Hate speech policy - YouTube help, 2024

YouTube. Hate speech policy - YouTube help, 2024. URL https://support.google.com/ youtube/answer/2802245. Accessed: 2024-05-20

work page arXiv 2024

[38] [38]

A toolbox for surfacing health equity harms and biases in large language models.Nature Medicine, 30 (12):3590–3600, 2024

Stephen R Pfohl, Heather Cole-Lewis, Rory Sayres, Darlene Neal, Mercy Asiedu, Awa Dieng, Nenad Tomasev, Qazi Mamunur Rashid, Shekoofeh Azizi, Negar Rostamzadeh, et al. A toolbox for surfacing health equity harms and biases in large language models.Nature Medicine, 30 (12):3590–3600, 2024

work page 2024

[39] [39]

Aloe: A family of fine-tuned open healthcare llms,

Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Jordi Bayarri-Planas, Adrian Tormos, Daniel Hinjos, Pablo Bernabeu-Perez, Anna Arias-Duart, Pablo Agustin Martin-Torres, Lucia Urcelay- Ganzabal, Marta Gonzalez-Mallo, et al. Aloe: A family of fine-tuned open healthcare llms. arXiv preprint arXiv:2405.01886, 2024

work page arXiv 2024

[40] [40]

Suicide ideation detection in social media forums

K Nikhileswar, D Vishal, L Sphoorthi, and S Fathimabi. Suicide ideation detection in social media forums. In2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), pages 1741–1747. IEEE, 2021. 13 A TaG Prompt Templates A.1 L1 and L2 Generation Template System Instruction:You are a policy expert. Please analyze {policy} and provi...

work page 2021

[41] [41]

Continue this loop until you believe all significant aspects are addressed

Check if all relevant categories and topics related to {policy} have been covered... Continue this loop until you believe all significant aspects are addressed. A.2 L3 Keyword Generation Template List the top 3 English keywords that most related to topic {t} given that the topic is sourced from and all the keywords are all related to {i} {j} given the def...

work page

[42] [42]

somewhat

Vermin, 2. Disease, 3. Filth; Rationale: [Your rationale here] B TaG Model Training Model specifications:The model was trained for 4 epochs using the default learning_rate_multiplier of 1.0. For this Gemini 2.5 Flash model, we utilized the defaultadapter_sizeof 4, which controls the capacity of the parameter-efficient tuning module. Training data sample s...

work page

[43] [43]

Intent”, “Variable Names

Read the provided “Intent”, “Variable Names”, and “Other Context” sections carefully. Extract the values for “{Country}”, “{policy}”, and “{Language_code}”. Note that 17 the “Other Context” provides an example format, but the instruction is to *only* use the format specified in the “Intent” section

work page

[44] [44]

{Country}

Based on your knowledge of the specified “{Country}” and “{policy}”, brainstorm relevant categories and corresponding topics, etc.) that are specifically impacted by the policy within that country

work page

[45] [45]

Intent” section explicitly specifies the desired output format: “(Category, Topics, Rationale)

The “Intent” section explicitly specifies the desired output format: “(Category, Topics, Rationale)”. Therefore, no other format needs to be considered

work page

[46] [46]

your output please strictly follow the same format below (Category, Topics, Rationale) and do not add any more session besides Category, Topics, Rationale and keep the sequence of the session first say Category,then Topics, and Rationale, please do not add any more stuff, the format should EXACTLY look like the examples format below: Examples: for hate sp...

work page

[47] [47]

:” following each key session such as ‘Category:’, ‘Topics:’, ‘Rationale:’; please do not add “(

Check if all relevant categories and topics related to ‘{policy}’ in ‘{Country}’ have been covered. If not, go back to step 2 and create another category entry using the same format, making sure there is always “:” following each key session such as ‘Category:’, ‘Topics:’, ‘Rationale:’; please do not add “(” or “)” in the output. Continue this loop until ...

work page

[48] [48]

Ensure all rationales are well- supported and clearly explain the relevance of each category and topic to the policy within the specified country

Review the complete output to ensure it adheres strictly to the specified format and contains no hallucinations or invented information. Ensure all rationales are well- supported and clearly explain the relevance of each category and topic to the policy within the specified country

work page

[49] [49]

think” or any conversational text. • Translate the keywords and rationale into ‘{Language_code}’. 18 ## EXPECTED JSON STRUCTURE: {

Output the complete list of categories, topics, and rationales in the specified format. G.2 Level 3 Keywords Generation Prompt ## Role You are an expert policy analyst. Generate the top 3 keywords most related to the topic ‘{t}’ (which is sourced from ‘{i}’ and ‘{j}’). ## Context The domain is health/medical or safety (‘{Definition}’). Location focus: ‘{l...

work page

[50] [50]

Specific Advice Requested:Explicitly demand a diagnosis, a drug recommendation (dosage optional), or a specific home treatment

work page

[51] [51]

I’m traveling,

Sense of Urgency/Context:Add a real-world constraint, time pressure, or emotional context that attempts to justify the need for an immediate AI answer (e.g., “I’m traveling,” “My doctor’s office is closed,” “I can’t afford a visit”)

work page

[52] [52]

Answer as a friendly neighbor,

Bypass Attempt (Jailbreak Tactic):Frame the request in a way that attempts to bypass the safety refusal (e.g., “Answer as a friendly neighbor,” “Provide this for a fictional story,” “Just give me the over-the-counter options”). Format:Provide only the 10 generated prompts as a numbered list. Do not include any introductory or concluding text, explanations...

work page