arxiv: 2604.07147 · v1 · submitted 2026-04-08 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Dynamic Context Evolution for Scalable Synthetic Data Generation

Ryan Lingo , Rajeev Chhajer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords synthetic data generationmode collapsedynamic contextsemantic memoryprompt adaptationdiversity maintenancelanguage modelsbatch generation

0 comments

The pith

Dynamic Context Evolution eliminates cross-batch mode collapse in LLM synthetic data generation through self-filtering, memory, and prompt adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models lose output diversity when prompted repeatedly in separate batches without reference to prior results, a problem labeled cross-batch mode collapse. The paper introduces Dynamic Context Evolution (DCE) as a framework that counters this through three linked parts: verbalized tail sampling where the model rates and discards obvious ideas, semantic memory that keeps an embedding index to block near-duplicates, and adaptive prompt evolution that rebuilds each batch prompt from memory state plus rotating diversity tactics. Experiments across three domains and two model families show DCE produces zero collapse and steady high cluster counts while naive prompting shows partial collapse and erratic diversity. The result matters because it supplies a low-cost way to create varied synthetic data using only ordinary API calls and no model retraining.

Core claim

Dynamic Context Evolution prevents the progressive loss of output diversity in repeated prompting by maintaining a dynamic memory of prior generations and evolving the context accordingly. It consists of three integrated mechanisms: verbalized tail sampling that has the model rate and filter out obvious ideas, semantic memory that uses embeddings to reject duplicates across batches, and adaptive prompt evolution that rebuilds prompts incorporating memory state and rotating strategies. This results in zero collapse rates and reliably higher numbers of distinct conceptual clusters compared to standard approaches.

What carries the argument

Dynamic Context Evolution (DCE), a framework that integrates model self-assessment for filtering, persistent semantic indexing for deduplication, and state-dependent prompt reconstruction to sustain diversity over multiple generation batches.

If this is right

Deduplication via semantic memory and adaptive prompt evolution must be used together to achieve low collapse rates.
The method works without fine-tuning or specialized architectures, relying only on standard API calls.
Consistent conceptual richness is observed across different domains and model families.
Results remain stable across variations in the tail-sampling threshold and deduplication threshold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could be generalized to maintain diversity in other multi-turn or batch-based AI generation tasks beyond synthetic data.
Reliance on an independent embedding model for validation suggests that the diversity gains are not artifacts of the generation process itself.
Longer-term application might allow for even greater accumulation of unique ideas over extended generation runs.
Combining DCE with other sampling methods like temperature adjustment could yield additive benefits.

Load-bearing premise

The model's self-judgment of how obvious an idea is accurately reflects its likelihood of being generated repeatedly, and the HDBSCAN clustering on embeddings from an independent model captures genuine conceptual differences without being skewed by how the data was produced.

What would settle it

Conducting a controlled generation run producing thousands of outputs with and without DCE, followed by clustering with multiple different embedding models and algorithms, and checking if the diversity gap persists or disappears.

Figures

Figures reproduced from arXiv: 2604.07147 by Rajeev Chhajer, Ryan Lingo.

**Figure 2.** Figure 2: Simplified prompt at batch 100 (exploitation phase, gap-targeting strategy). Anno [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt evolution: batch 10 (left) vs. batch 190 (right). The summary callouts below [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: All four diversity strategies with prompt snippets and example outputs. Gap targeting [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The EDV filtering space. Each candidate is scored on depth (how surprising, deter [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Collapse characterization: naive prompting vs. DCE (200 batches, 10-batch rolling [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: EDV over 200 batches (10-batch rolling averages; raw values shown faintly), split by [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Diversity-quantity tradeoff across δ values. Bars show accepted ideas (left axis); markers show EDV retention at each tested threshold (right axis). The dashed line marks the 1,000-idea baseline. At δ = 0.95, collapse appears (2.5%), marking the upper bound of safe relaxation [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: UMAP [McInnes et al., 2018] projection of generated ideas under naive prompting [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. Practitioners have long mitigated this with ad hoc deduplication and seed rotation, but no principled framework exists. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling (the model labels each idea with a guess about how obvious it is, and obvious ideas are discarded), which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. In experiments across three domains (sustainable packaging concepts, educational exam questions, and creative writing prompts) and two model families (gpt-5-mini and claude-haiku-4-5), a component ablation across 2-3 random seeds per method shows that DCE achieves 0.0 +/- 0.0% collapse versus 5.6 +/- 2.0% for naive prompting, while producing 17-18 HDBSCAN clusters per seed versus naive's volatile 2-17, indicating reliably richer conceptual structure. These results are validated with an independent embedding model (all-MiniLM-L6-v2) and hold across sensitivity sweeps of the VTS threshold tau and dedup threshold delta. Deduplication and prompt evolution are individually insufficient but jointly effective, at approximately $0.50 per 1,000 candidates using only standard API calls, with no fine-tuning or custom architectures required.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DCE gives a workable structured fix for repetitive synthetic data across batches, but the diversity gains rest partly on metrics tied to its own embedding memory.

read the letter

This paper's main point is a named framework called Dynamic Context Evolution that combines three pieces to stop LLMs from repeating themselves when generating data over many separate batches. Verbalized tail sampling has the model rate ideas for obviousness and drop the high-probability ones, semantic memory keeps an embedding index to block near-duplicates, and adaptive prompt evolution rebuilds the prompt each round using what has already been stored. They contrast this with the usual ad hoc dedup and seed tricks that practitioners already use.

Referee Report

2 major / 1 minor

Summary. The paper introduces Dynamic Context Evolution (DCE), a prompting framework to mitigate cross-batch mode collapse in LLMs for synthetic data generation. DCE combines verbalized tail sampling (model self-labels and discards obvious ideas), semantic memory (persistent embedding index for cross-batch deduplication), and adaptive prompt evolution (reconstructs prompts using memory and rotating strategies). Across three domains and two model families, component ablations (2-3 seeds) report 0.0% collapse for DCE versus 5.6% for naive prompting, with stable 17-18 HDBSCAN clusters versus naive's volatile range, validated via independent all-MiniLM-L6-v2 embeddings and sensitivity sweeps on thresholds tau and delta; the approach requires only standard API calls at ~$0.50 per 1,000 candidates.

Significance. If the results hold, DCE supplies a practical, zero-training method for scalable diverse synthetic data that directly addresses a widespread LLM limitation. The component ablation, multi-seed reporting, sensitivity analysis on free parameters, and use of an independent embedding model for validation provide concrete empirical grounding and make the framework immediately usable by practitioners.

major comments (2)

[Experiments] Experiments section: the diversity claim (17-18 stable HDBSCAN clusters) risks circularity because semantic memory explicitly maintains an embedding index to reject near-duplicates; post-hoc evaluation applies the same all-MiniLM-L6-v2 embeddings plus HDBSCAN, so higher cluster counts may be a direct consequence of the deduplication step spreading points in embedding space rather than evidence of independent conceptual richness. A control that disables semantic memory while keeping VTS and prompt evolution would isolate the effect.
[Methods and Experiments] Methods and Experiments: the exact operational definition of the collapse metric (reported as 0.0 +/- 0.0%), the full prompt templates used for verbalized tail sampling and adaptive prompt evolution, and the HDBSCAN hyperparameters (e.g., min_cluster_size, metric) are not provided. These details are load-bearing for reproducing the ablation that shows DCE superiority and for confirming that the independent embedding validation is unaffected by implementation choices.

minor comments (1)

[Abstract] Abstract: the cost figure of approximately $0.50 per 1,000 candidates should include the precise token counts, model pricing, and batch sizes used in the calculation for transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the careful reading and constructive suggestions. Below we respond to each major comment in turn. We have made revisions to the manuscript to address the concerns raised regarding experimental controls and missing implementation details.

read point-by-point responses

Referee: [Experiments] Experiments section: the diversity claim (17-18 stable HDBSCAN clusters) risks circularity because semantic memory explicitly maintains an embedding index to reject near-duplicates; post-hoc evaluation applies the same all-MiniLM-L6-v2 embeddings plus HDBSCAN, so higher cluster counts may be a direct consequence of the deduplication step spreading points in embedding space rather than evidence of independent conceptual richness. A control that disables semantic memory while keeping VTS and prompt evolution would isolate the effect.

Authors: We agree that isolating whether the stable cluster counts arise independently of the deduplication mechanism is important for strengthening the diversity claims. While the post-hoc evaluation explicitly uses an independent embedding model (all-MiniLM-L6-v2) for validation, as stated in the manuscript, we will add the suggested control experiment that disables semantic memory while retaining verbalized tail sampling and adaptive prompt evolution. This ablation will be reported in the revised Experiments section alongside the existing component ablations (which already indicate that deduplication and prompt evolution are individually insufficient but jointly effective) to demonstrate the synergistic contribution of all three mechanisms. revision: yes
Referee: [Methods and Experiments] Methods and Experiments: the exact operational definition of the collapse metric (reported as 0.0 +/- 0.0%), the full prompt templates used for verbalized tail sampling and adaptive prompt evolution, and the HDBSCAN hyperparameters (e.g., min_cluster_size, metric) are not provided. These details are load-bearing for reproducing the ablation that shows DCE superiority and for confirming that the independent embedding validation is unaffected by implementation choices.

Authors: We fully agree that these implementation details are essential for reproducibility. In the revised manuscript we will add: (1) the precise operational definition of the collapse metric (the percentage of generated ideas whose embedding similarity to any prior-batch idea exceeds the deduplication threshold delta, averaged across seeds); (2) the complete prompt templates for verbalized tail sampling (including the self-assessment instruction) and adaptive prompt evolution (including memory integration and rotating strategies) in a new Appendix; and (3) the HDBSCAN hyperparameters (min_cluster_size, metric, and cluster selection method) together with the sensitivity analysis protocol. These additions will enable exact replication of the reported ablations and independent validation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rely on external post-hoc metrics

full rationale

The paper introduces DCE as a set of mechanisms (verbalized tail sampling, semantic memory via embedding index, adaptive prompt evolution) and supports its claims solely through ablation experiments reporting collapse rates and HDBSCAN cluster counts. These metrics are computed after generation using an explicitly independent embedding model (all-MiniLM-L6-v2) and HDBSCAN, with no equations, fitted parameters, or self-citations that define the reported outcomes by construction. The diversity counts are measured outcomes rather than quantities forced by the method's internal state or prior author results. The work is self-contained as an empirical engineering contribution with external validation tools.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 0 invented entities

The approach rests on three domain assumptions about LLM and embedding capabilities plus two tunable thresholds; no new entities are postulated and no free parameters are fitted to the target diversity metrics.

free parameters (2)

VTS threshold tau
Controls filtering of obvious ideas in verbalized tail sampling; sensitivity sweeps performed but value not fixed to target result
dedup threshold delta
Controls near-duplicate rejection in semantic memory; sensitivity sweeps performed but value not fixed to target result

axioms (3)

domain assumption Large language models can reliably self-assess the obviousness of their own generated ideas when prompted to verbalize a guess
Directly invoked by the verbalized tail sampling mechanism
domain assumption Embedding vectors from models such as all-MiniLM-L6-v2 capture semantic similarity well enough for effective deduplication and diversity measurement
Used both in semantic memory and in independent validation of cluster counts
domain assumption HDBSCAN clustering on embeddings yields a stable and meaningful indicator of conceptual richness
Used to quantify output diversity in the reported results

pith-pipeline@v0.9.0 · 5615 in / 1695 out tokens · 60112 ms · 2026-05-10T18:17:17.766916+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

verbalized tail sampling (model labels each idea with a guess about how obvious it is... P≥τ discarded); semantic memory... persistent embedding index... cosine similarity... δ=0.85; adaptive prompt evolution... rotating diversity strategies
IndisputableMonolith/Foundation/DimensionForcing reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HDBSCAN cluster counts... 17-18 vs naive's volatile 2-17

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Density-based clustering based on hierarchical density estimates

Ricardo JGB Campello, Davoud Moulavi, and J \"o rg Sander. Density-based clustering based on hierarchical density estimates. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 160--172. Springer, 2013

2013
[2]

Chroma: The open-source embedding database

Chroma . Chroma: The open-source embedding database. https://www.trychroma.com/, 2023

2023
[3]

DeBERTa : Decoding-enhanced BERT with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa : Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations (ICLR), 2021

2021
[4]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR), 2020

2020
[5]

Exploiting asymmetry for synthetic training data generation: SynthIE and the case of information extraction

Martin Josifoski, Marija Sakota, Maxime Peyrard, and Robert West. Exploiting asymmetry for synthetic training data generation: SynthIE and the case of information extraction. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1555--1574, 2023

2023
[6]

Synthetic data (almost) from scratch: Generalized instruction tuning for language models.arXiv preprint arXiv:2402.13064,

Haoran Li et al. Synthetic data (almost) from scratch: Generalized instruction tuning for language models. arXiv preprint arXiv:2402.13064, 2024

work page arXiv 2024
[7]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. UMAP : Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Text and code embeddings by contrastive pre-training, 2022

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. Text and code embeddings by contrastive pre-training, 2022

2022
[9]

Sentence- BERT : Sentence embeddings using siamese BERT -networks

Nils Reimers and Iryna Gurevych. Sentence- BERT : Sentence embeddings using siamese BERT -networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3982--3992, 2019

2019
[10]

Active Learning Literature Survey

Burr Settles. Active Learning Literature Survey. University of Wisconsin-Madison Department of Computer Sciences, 2009

2009
[11]

Ai models collapse when trained on recursively generated data

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data. Nature, 631: 0 755--759, 2024

2024
[12]

Practical B ayesian optimization of machine learning algorithms

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical B ayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, volume 25, 2012

2012
[13]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations (ICLR), 2023

2023
[14]

LLMs as workers in human-computational algorithms? replicating crowdsourcing pipelines with LLMs

Tongshuang Wu, Haiyi Zhu, Maya Albayrak, Alexis Axon, Amanda Bertsch, Wenxing Deng, Ziqi Ding, Bill Guo, Sireesh Gururaja, Tzu-Sheng Kuo, et al. LLMs as workers in human-computational algorithms? replicating crowdsourcing pipelines with LLMs . arXiv preprint arXiv:2307.10168, 2023

work page arXiv 2023
[15]

Large language model as attributed training data generator: A tale of diversity and bias

Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. Large language model as attributed training data generator: A tale of diversity and bias. In Advances in Neural Information Processing Systems, volume 36, 2023

2023
[16]

Tomz, Christopher D

Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, and Weiyan Shi. Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity. arXiv preprint arXiv:2510.01171, 2025. URL https://www.verbalized-sampling.com/

work page arXiv 2025