TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
Pith reviewed 2026-05-22 18:58 UTC · model grok-4.3
The pith
Three million moral fables can be generated by instruction-tuned models no larger than 8 billion parameters through a fixed six-part template.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a combinatorial prompt engine applied to instruction-tuned models no larger than 8B parameters can produce three million coherent moral fables, each following a character-trait-setting-conflict-resolution-moral scaffold, and that a panel of open-weight judges from different families can score these fables reliably enough to select high-quality outputs and benchmark costs at roughly $0.135 per thousand stories.
What carries the argument
A combinatorial prompt engine that fills a six-slot scaffold of character, trait, setting, conflict, resolution, and moral to enforce genre consistency across a broad range of themes.
If this is right
- Small open models can now serve as practical generators of large moral story corpora without proprietary infrastructure.
- The released evaluation scripts provide a reproducible baseline for measuring narrative quality and ethical content in synthetic text.
- The dataset supplies ready training material for studying how language models acquire instruction following and value alignment.
- Cost figures demonstrate that consumer-grade hardware suffices for producing thousands of structured educational stories at scale.
Where Pith is reading between the lines
- The same scaffold method might be reused to generate moral stories in additional languages or for specialized ethical domains.
- Fine-tuning small models on this corpus could be tested for downstream gains in coherent storytelling or ethical reasoning benchmarks.
- The dataset offers a controlled starting point for experiments that mix synthetic moral fables with human-written examples.
Load-bearing premise
That scores from open LLM judges drawn from different families reliably match what humans would say about moral clarity and template adherence in the generated fables.
What would settle it
A human rating study on a random sample of the fables that shows low correlation between the LLM judge scores for moral clarity and the ratings given by people.
Figures
read the original abstract
Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We present TF1-EN-3M, to our knowledge the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A fully reproducible evaluation pipeline employs a panel of open-weight LLM judges from distinct model families, scoring grammar, creativity, moral clarity, and template adherence, complemented by reference-free diversity and readability metrics. Among ten open-weight generator candidates, an 8B-parameter Llama-3 variant delivers the best quality-cost trade-off, producing high-scoring fables on consumer hardware at approximately $0.135 per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI -- demonstrating that large-scale moral storytelling requires neither proprietary giant models nor proprietary evaluation infrastructure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents TF1-EN-3M, a dataset of three million English-language moral fables generated exclusively by instruction-tuned models no larger than 8B parameters. Generation uses a combinatorial prompt engine enforcing a fixed six-slot scaffold (character, trait, setting, conflict, resolution, moral) to ensure genre fidelity and broad thematic coverage. A reproducible evaluation pipeline applies open-weight LLM judges from distinct families to score grammar, creativity, moral clarity, and template adherence, supplemented by reference-free diversity and readability metrics. The authors identify an 8B Llama-3 variant as providing the best quality-cost trade-off at roughly $0.135 per 1,000 fables on consumer hardware and release the full dataset, generation code, evaluation scripts, and metadata under a permissive license.
Significance. If the quality and diversity claims are substantiated, TF1-EN-3M would constitute a valuable open resource for research on narrative generation, instruction following, value alignment, and child-friendly educational AI. The exclusive use of small open-weight models, fully reproducible pipeline, and public release of all components (including cost benchmarking) strengthen its potential impact by lowering barriers to entry and enabling exact replication without proprietary infrastructure.
major comments (2)
- [Evaluation section] The central quality claims for moral clarity and template adherence rest exclusively on scores produced by the panel of open-weight LLM judges (Evaluation section). No human correlation study, inter-annotator agreement benchmark, or validation against human raters is reported. Because the combinatorial scaffold guarantees only structural fidelity and not narrative depth or ethical coherence, the absence of such validation is load-bearing for the asserted utility in value alignment and educational applications.
- [Results section] The selection of the 8B Llama-3 variant as the best quality-cost trade-off (Results section) is determined solely by the LLM-judge scores. Without evidence that these scores correlate with human judgments of fable quality, the headline trade-off and downstream suitability claims cannot be considered robust.
minor comments (3)
- [Generation section] Clarify the exact procedure used to sample the 3M combinations from the combinatorial engine and report any deduplication or diversity-enforcement steps beyond the aggregate metrics.
- [Evaluation section] The readability and diversity metrics would benefit from explicit comparison to a small set of human-written fables or established literary corpora to provide context for the reported values.
- Ensure every table and figure is referenced in the main text and that axis labels and legends are fully legible without reference to the caption.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the evaluation methodology and results. We address each major point below, acknowledging the limitations of relying solely on LLM judges while proposing targeted revisions to improve robustness without altering the core contribution of the open, reproducible dataset and pipeline.
read point-by-point responses
-
Referee: [Evaluation section] The central quality claims for moral clarity and template adherence rest exclusively on scores produced by the panel of open-weight LLM judges (Evaluation section). No human correlation study, inter-annotator agreement benchmark, or validation against human raters is reported. Because the combinatorial scaffold guarantees only structural fidelity and not narrative depth or ethical coherence, the absence of such validation is load-bearing for the asserted utility in value alignment and educational applications.
Authors: We agree that the lack of human validation represents a genuine limitation for claims involving narrative depth, ethical coherence, and downstream applications in value alignment or education. The combinatorial scaffold ensures structural fidelity by design, but we did not claim it guarantees depth. Our pipeline uses multiple open-weight judges from distinct families precisely to increase robustness and reduce single-model bias, with supplementary reference-free metrics. To address this, we will add a dedicated Limitations subsection in the revised Evaluation section and include a small-scale human correlation study on a random sample of 200 fables (rated by three independent human annotators for moral clarity and creativity), reporting Pearson correlations with the LLM-judge scores. revision: partial
-
Referee: [Results section] The selection of the 8B Llama-3 variant as the best quality-cost trade-off (Results section) is determined solely by the LLM-judge scores. Without evidence that these scores correlate with human judgments of fable quality, the headline trade-off and downstream suitability claims cannot be considered robust.
Authors: The 8B Llama-3 variant was selected based on aggregated LLM-judge scores because these enable fully reproducible, scalable comparison across all ten candidates on consumer hardware. We recognize that this makes the trade-off claim dependent on the validity of the judges as proxies. In the revision we will qualify the Results section accordingly, present the quality-cost figures with explicit caveats, and incorporate the correlation results from the planned human pilot study to provide supporting evidence for the selection while noting it remains a proxy measure. revision: partial
Circularity Check
No circularity: dataset construction paper with no derivations or self-referential predictions
full rationale
The paper describes the generation of a 3M fable dataset via combinatorial prompts on small instruction-tuned models and an evaluation pipeline using open-weight LLM judges from distinct families. No equations, fitted parameters, predictions, or uniqueness theorems appear. Central claims (first open dataset of this scale, quality-cost trade-off, reproducibility) rest on the released code, data, and external model outputs rather than reducing to the paper's own inputs by construction. LLM-judge scores are a methodological choice whose validity can be checked externally; they do not create a definitional loop.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Instruction-tuned language models no larger than 8B parameters can generate coherent, genre-faithful moral fables when guided by a combinatorial six-slot prompt scaffold.
- domain assumption A panel of open-weight LLM judges from distinct model families can produce reliable scores for grammar, creativity, moral clarity, and template adherence.
Reference graph
Works this paper leans on
-
[1]
A Corpus for Understanding and Generating Moral Stories, April 2022
Jian Guan, Ziqi Liu, and Minlie Huang. A Corpus for Understanding and Generating Moral Stories, April 2022. arXiv:2204.09438 [cs]
-
[2]
Aesop. Aesop’s Fables. OUP Oxford, July 2002. Google-Books-ID: n2LlrCeYl7gC
work page 2002
-
[3]
Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey, June 2024. arXiv:2406.15126 [cs]
-
[4]
Ronen Eldan and Yuanzhi Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?, May 2023. arXiv:2305.07759 [cs]
-
[5]
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks Are All You Need, October 2023. arXiv:2306...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Synthetic Data Generation Using Large Language Models: Advances in Text and Code, March 2025
Mihai Nadas, Laura Diosan, and Andreea Tomescu. Synthetic Data Generation Using Large Language Models: Advances in Text and Code, March 2025. arXiv:2503.14023 [cs]
-
[7]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment, May 2023. arXiv:2303.16634 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
GPTScore: Evaluate as You Desire
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as You Desire, February 2023. arXiv:2302.04166 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Texygen: A Bench- marking Platform for Text Generation Models
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A Bench- marking Platform for Text Generation Models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, pages 1097–1100, New York, NY , USA, June 2018. Association for Computing Machinery
work page 2018
-
[10]
A Diversity-Promoting Objective Function for Neural Conversation Models
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A Diversity-Promoting Objective Function for Neural Conversation Models, June 2016. arXiv:1510.03055 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
J. P. Kincaid and And Others. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical report, National Technical Information Service, Springfield, Virginia 22151 (AD-A006 655/5GA, MF $2, February 1975. ERIC Number: ED108134
work page 1975
-
[12]
klusai/ds-tf1-en-3m · Datasets at Hugging Face
-
[13]
What Makes Good In-Context Examples for GPT-$3$?, January 2021
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What Makes Good In-Context Examples for GPT-$3$?, January 2021. arXiv:2101.06804 [cs]
-
[14]
Reframing Instructional Prompts to GPTk’s Language, March 2022
Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. Reframing Instructional Prompts to GPTk’s Language, March 2022. arXiv:2109.07830 [cs]
-
[15]
Hierarchical Neural Story Generation
Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural Story Generation, May 2018. arXiv:1805.04833 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Plan-and-Write: Towards Better Automatic Storytelling
Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. Plan-and-Write: Towards Better Automatic Storytelling. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):7378–7385, July 2019. Number: 01
work page 2019
-
[17]
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July 2023...
work page 2023
-
[18]
Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, and Lidong Bing. Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization, October 2023. arXiv:2305.13091 [cs]
-
[19]
Short Fiction, Flash Fiction, Microfiction
Angela Naimou. Short Fiction, Flash Fiction, Microfiction. In Joshua Miller, editor,The Cambridge Companion to Twenty-First Century American Fiction, Cambridge Companions to Literature, pages 21–42. Cambridge University Press, Cambridge, 2021
work page 2021
-
[20]
Trading Off Diversity and Quality in Natural Language Generation
Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan. Trading Off Diversity and Quality in Natural Language Generation. ArXiv, April 2020
work page 2020
-
[21]
HuggingFaceTB/SmolLM2-1.7B-Instruct · Hugging Face, February 2025
work page 2025
-
[22]
CohereForAI/aya-23-8B · Hugging Face, March 2025
work page 2025
-
[23]
13 A PREPRINT - APRIL 30, 2025
meta-llama/Llama-3.2-1B-Instruct · Hugging Face, December 2024. 13 A PREPRINT - APRIL 30, 2025
work page 2024
-
[24]
meta-llama/Llama-3.1-8B-Instruct · Hugging Face, December 2024
work page 2024
-
[25]
mistralai/Mistral-7B-Instruct-v0.3 · Hugging Face
-
[26]
Qwen/Qwen2.5-7B-Instruct · Hugging Face, February 2025
work page 2025
-
[27]
deepseek-ai/deepseek-llm-7b-chat · Hugging Face, August 2024
work page 2024
-
[28]
microsoft/Phi-3-mini-4k-instruct · Hugging Face, January 2025
work page 2025
-
[29]
tiiuae/Falcon3-7B-Instruct · Hugging Face, February 2025
work page 2025
-
[30]
Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations, October 2023. arXiv:2310.07849 [cs]
-
[31]
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling Synthetic Data Creation with 1,000,000,000 Personas, September 2024. arXiv:2406.20094 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...
work page 2023
-
[33]
M. O. Riedl and R. M. Young. Narrative Planning: Balancing Plot and Character. Journal of Artificial Intelligence Research, 39:217–268, September 2010
work page 2010
-
[34]
A Comparative Study of Quality Evaluation Methods for Text Summarization, June 2024
Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, and Junhua Ding. A Comparative Study of Quality Evaluation Methods for Text Summarization, June 2024. arXiv:2407.00747 [cs]
-
[35]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models, January 2020. arXiv:2001.08361 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[36]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, December 2021
work page 2021
-
[37]
original-date: 2025-01-07T20:20:42Z
klusai/tinyfabulist, April 2025. original-date: 2025-01-07T20:20:42Z
work page 2025
-
[38]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Com...
work page 2016
-
[40]
Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A. Raffel. Scaling Data-Constrained Language Models. Advances in Neural Information Processing Systems, 36:50358–50376, December 2023
work page 2023
-
[41]
Hwang, Maxwell Forbes, and Yejin Choi
Denis Emelin, Ronan Le Bras, Jena D. Hwang, Maxwell Forbes, and Yejin Choi. Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences, December 2020. arXiv:2012.15738 [cs]
-
[42]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, December 2023. arXiv:2306.05685 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
AI Chip - AWS Inferentia - AWS
-
[44]
NVIDIA A100 GPUs Power the Modern Data Center
-
[45]
NVIDIA L4 Tensor Core GPU. A Hardware and Environment Configurations We benchmarked inference for the TF1 -EN-3M dataset under several GPU configurations using Llama-3.1-8B-Instruct with identical prompts and decoding settings. All experiments were executed on Hugging Face Inference Endpoints; the hourly tariffs advertised by Hugging Face in April 2025 we...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.