Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.
hub
Best practices and lessons learned on synthetic data
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 12roles
background 2polarities
background 2representative citing papers
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
SCALR generates synthetic cross-domain events to augment recommendation training data and reports statistically significant improvements in industrial A/B tests.
OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
PROVSYN synthesizes high-fidelity security provenance graphs via graph generation and LLMs to augment imbalanced datasets, improving downstream APT detection accuracy by up to 38% on benchmarks.
Evaluates 42 variants of foundation models across three formalized paradigms for missing modality reconstruction, identifies shortfalls in semantic extraction and validation, and introduces an agentic framework that reduces FID by at least 14% for images and MER by at least 10% for text.
Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.
Fine-tunes Qwen2.5-7B on 21,543 synthetic maritime Q&A pairs generated from 3.2B AIS records by GPT-4o and o3-mini, reaching 75% accuracy at 261x lower inference cost than larger models.
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
CTGAN and LLMs generate synthetic student data that passes statistical and predictive utility checks for learning analytics.
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
citing papers explorer
-
Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA
Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.
-
Synthetic Data from Cross-Domain Events for Large-Scale Recommendation Systems
SCALR generates synthetic cross-domain events to augment recommendation training data and reports statistically significant improvements in industrial A/B tests.
-
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.