hub

Best practices and lessons learned on synthetic data,

· 2024 · arXiv 2404.07503

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

cs.CL · 2024-06-12 · unverdicted · novelty 7.0

Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

cs.MM · 2026-04-25 · unverdicted · novelty 6.0

OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

No Data? No Problem: Synthesizing Security Graphs for Better Intrusion Detection

cs.CR · 2025-06-06 · unverdicted · novelty 5.0

PROVSYN synthesizes high-fidelity security provenance graphs via graph generation and LLMs to augment imbalanced datasets, improving downstream APT detection accuracy by up to 38% on benchmarks.

How Far Are We from Generating Missing Modalities with Foundation Models?

cs.MM · 2025-06-04 · unverdicted · novelty 5.0

Evaluates 42 variants of foundation models across three formalized paradigms for missing modality reconstruction, identifies shortfalls in semantic extraction and validation, and introduces an agentic framework that reduces FID by at least 14% for images and MER by at least 10% for text.

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

cs.CL · 2024-06-17 · unverdicted · novelty 5.0

Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.

Multi-Model Synthetic Training for Mission-Critical Small Language Models

cs.CL · 2025-09-16 · unverdicted · novelty 4.0

Fine-tunes Qwen2.5-7B on 21,543 synthetic maritime Q&A pairs generated from 3.2B AIS records by GPT-4o and o3-mini, reaching 75% accuracy at 261x lower inference cost than larger models.

From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

cs.SE · 2024-10-28 · unverdicted · novelty 4.0

A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.

Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

cs.LG · 2025-01-03 · unverdicted · novelty 3.0

CTGAN and LLMs generate synthetic student data that passes statistical and predictive utility checks for learning analytics.

A Survey on Large Language Models for Code Generation

cs.CL · 2024-06-01 · unverdicted · novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

citing papers explorer

Showing 10 of 10 citing papers.

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing cs.CL · 2024-06-12 · unverdicted · none · ref 128
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models cs.MM · 2026-04-25 · unverdicted · none · ref 42
OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 112
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
No Data? No Problem: Synthesizing Security Graphs for Better Intrusion Detection cs.CR · 2025-06-06 · unverdicted · none · ref 43
PROVSYN synthesizes high-fidelity security provenance graphs via graph generation and LLMs to augment imbalanced datasets, improving downstream APT detection accuracy by up to 38% on benchmarks.
How Far Are We from Generating Missing Modalities with Foundation Models? cs.MM · 2025-06-04 · unverdicted · none · ref 66
Evaluates 42 variants of foundation models across three formalized paradigms for missing modality reconstruction, identifies shortfalls in semantic extraction and validation, and introduces an agentic framework that reduces FID by at least 14% for images and MER by at least 10% for text.
Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression cs.CL · 2024-06-17 · unverdicted · none · ref 36
Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.
Multi-Model Synthetic Training for Mission-Critical Small Language Models cs.CL · 2025-09-16 · unverdicted · none · ref 6
Fine-tunes Qwen2.5-7B on 21,543 synthetic maritime Q&A pairs generated from 3.2B AIS records by GPT-4o and o3-mini, reaching 75% accuracy at 261x lower inference cost than larger models.
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap cs.SE · 2024-10-28 · unverdicted · none · ref 72
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation cs.LG · 2025-01-03 · unverdicted · none · ref 32
CTGAN and LLMs generate synthetic student data that passes statistical and predictive utility checks for learning analytics.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 169
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

Best practices and lessons learned on synthetic data,

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer