hub

Best practices and lessons learned on synthetic data

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M · 2024 · arXiv 2404.07503

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

cs.CL · 2024-06-12 · unverdicted · novelty 7.0

Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.

Synthetic Data from Cross-Domain Events for Large-Scale Recommendation Systems

cs.IR · 2026-05-29 · unverdicted · novelty 6.0

SCALR generates synthetic cross-domain events to augment recommendation training data and reports statistically significant improvements in industrial A/B tests.

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

cs.MM · 2026-04-25 · unverdicted · novelty 6.0

OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

No Data? No Problem: Synthesizing Security Graphs for Better Intrusion Detection

cs.CR · 2025-06-06 · unverdicted · novelty 5.0

PROVSYN synthesizes high-fidelity security provenance graphs via graph generation and LLMs to augment imbalanced datasets, improving downstream APT detection accuracy by up to 38% on benchmarks.

How Far Are We from Generating Missing Modalities with Foundation Models?

cs.MM · 2025-06-04 · unverdicted · novelty 5.0

Evaluates 42 variants of foundation models across three formalized paradigms for missing modality reconstruction, identifies shortfalls in semantic extraction and validation, and introduces an agentic framework that reduces FID by at least 14% for images and MER by at least 10% for text.

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

cs.CL · 2024-06-17 · unverdicted · novelty 5.0

Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.

Multi-Model Synthetic Training for Mission-Critical Small Language Models

cs.CL · 2025-09-16 · unverdicted · novelty 4.0

Fine-tunes Qwen2.5-7B on 21,543 synthetic maritime Q&A pairs generated from 3.2B AIS records by GPT-4o and o3-mini, reaching 75% accuracy at 261x lower inference cost than larger models.

From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

cs.SE · 2024-10-28 · unverdicted · novelty 4.0

A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.

Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

cs.LG · 2025-01-03 · unverdicted · novelty 3.0

CTGAN and LLMs generate synthetic student data that passes statistical and predictive utility checks for learning analytics.

A Survey on Large Language Models for Code Generation

cs.CL · 2024-06-01 · unverdicted · novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA cs.AI · 2026-06-30 · unverdicted · none · ref 20
Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.
Synthetic Data from Cross-Domain Events for Large-Scale Recommendation Systems cs.IR · 2026-05-29 · unverdicted · none · ref 5
SCALR generates synthetic cross-domain events to augment recommendation training data and reports statistically significant improvements in industrial A/B tests.
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models cs.MM · 2026-04-25 · unverdicted · none · ref 42
OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.

Best practices and lessons learned on synthetic data

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer