pith. sign in

arxiv: 2504.20605 · v2 · submitted 2025-04-29 · 💻 cs.CL · cs.AI· cs.DL· cs.LG

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

Pith reviewed 2026-05-22 18:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DLcs.LG
keywords synthetic moral fablessmall language modelsdataset creationinstruction-tuned modelsLLM judgesvalue alignmentnarrative generationopen source data
0
0 comments X

The pith

Three million moral fables can be generated by instruction-tuned models no larger than 8 billion parameters through a fixed six-part template.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a large open dataset to fill the gap in structured moral narratives that pair stories with explicit ethical lessons. It uses a combinatorial system to produce three million English fables, each built from character, trait, setting, conflict, resolution, and moral slots. Quality is measured by a panel of small open models that score grammar, creativity, moral clarity, and template fit, with one 8B Llama variant emerging as the strongest performer at low cost. The authors release the full dataset, generation code, and evaluation scripts so others can reproduce the work exactly. This setup lets researchers explore narrative generation and value alignment using only accessible open tools instead of large closed systems.

Core claim

The authors establish that a combinatorial prompt engine applied to instruction-tuned models no larger than 8B parameters can produce three million coherent moral fables, each following a character-trait-setting-conflict-resolution-moral scaffold, and that a panel of open-weight judges from different families can score these fables reliably enough to select high-quality outputs and benchmark costs at roughly $0.135 per thousand stories.

What carries the argument

A combinatorial prompt engine that fills a six-slot scaffold of character, trait, setting, conflict, resolution, and moral to enforce genre consistency across a broad range of themes.

If this is right

  • Small open models can now serve as practical generators of large moral story corpora without proprietary infrastructure.
  • The released evaluation scripts provide a reproducible baseline for measuring narrative quality and ethical content in synthetic text.
  • The dataset supplies ready training material for studying how language models acquire instruction following and value alignment.
  • Cost figures demonstrate that consumer-grade hardware suffices for producing thousands of structured educational stories at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scaffold method might be reused to generate moral stories in additional languages or for specialized ethical domains.
  • Fine-tuning small models on this corpus could be tested for downstream gains in coherent storytelling or ethical reasoning benchmarks.
  • The dataset offers a controlled starting point for experiments that mix synthetic moral fables with human-written examples.

Load-bearing premise

That scores from open LLM judges drawn from different families reliably match what humans would say about moral clarity and template adherence in the generated fables.

What would settle it

A human rating study on a random sample of the fables that shows low correlation between the LLM judge scores for moral clarity and the ratings given by people.

Figures

Figures reproduced from arXiv: 2504.20605 by Andreea Tomescu, Andrei Piscoran, Laura Diosan, Mihai Nadas.

Figure 1
Figure 1. Figure 1: Full pipeline for generating TF1-EN-3M. • Adherence to the classic fable format, composed of six core elements: character, trait, setting, conflict, resolution, and moral. It also defined five distinct age groups (A–E) used later during evaluation to assess target audience alignment. These audience categories helped our LLM-based critic assign each story to an appropriate demographic bracket. The system me… view at source ↗
read the original abstract

Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We present TF1-EN-3M, to our knowledge the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A fully reproducible evaluation pipeline employs a panel of open-weight LLM judges from distinct model families, scoring grammar, creativity, moral clarity, and template adherence, complemented by reference-free diversity and readability metrics. Among ten open-weight generator candidates, an 8B-parameter Llama-3 variant delivers the best quality-cost trade-off, producing high-scoring fables on consumer hardware at approximately $0.135 per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI -- demonstrating that large-scale moral storytelling requires neither proprietary giant models nor proprietary evaluation infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents TF1-EN-3M, a dataset of three million English-language moral fables generated exclusively by instruction-tuned models no larger than 8B parameters. Generation uses a combinatorial prompt engine enforcing a fixed six-slot scaffold (character, trait, setting, conflict, resolution, moral) to ensure genre fidelity and broad thematic coverage. A reproducible evaluation pipeline applies open-weight LLM judges from distinct families to score grammar, creativity, moral clarity, and template adherence, supplemented by reference-free diversity and readability metrics. The authors identify an 8B Llama-3 variant as providing the best quality-cost trade-off at roughly $0.135 per 1,000 fables on consumer hardware and release the full dataset, generation code, evaluation scripts, and metadata under a permissive license.

Significance. If the quality and diversity claims are substantiated, TF1-EN-3M would constitute a valuable open resource for research on narrative generation, instruction following, value alignment, and child-friendly educational AI. The exclusive use of small open-weight models, fully reproducible pipeline, and public release of all components (including cost benchmarking) strengthen its potential impact by lowering barriers to entry and enabling exact replication without proprietary infrastructure.

major comments (2)
  1. [Evaluation section] The central quality claims for moral clarity and template adherence rest exclusively on scores produced by the panel of open-weight LLM judges (Evaluation section). No human correlation study, inter-annotator agreement benchmark, or validation against human raters is reported. Because the combinatorial scaffold guarantees only structural fidelity and not narrative depth or ethical coherence, the absence of such validation is load-bearing for the asserted utility in value alignment and educational applications.
  2. [Results section] The selection of the 8B Llama-3 variant as the best quality-cost trade-off (Results section) is determined solely by the LLM-judge scores. Without evidence that these scores correlate with human judgments of fable quality, the headline trade-off and downstream suitability claims cannot be considered robust.
minor comments (3)
  1. [Generation section] Clarify the exact procedure used to sample the 3M combinations from the combinatorial engine and report any deduplication or diversity-enforcement steps beyond the aggregate metrics.
  2. [Evaluation section] The readability and diversity metrics would benefit from explicit comparison to a small set of human-written fables or established literary corpora to provide context for the reported values.
  3. Ensure every table and figure is referenced in the main text and that axis labels and legends are fully legible without reference to the caption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation methodology and results. We address each major point below, acknowledging the limitations of relying solely on LLM judges while proposing targeted revisions to improve robustness without altering the core contribution of the open, reproducible dataset and pipeline.

read point-by-point responses
  1. Referee: [Evaluation section] The central quality claims for moral clarity and template adherence rest exclusively on scores produced by the panel of open-weight LLM judges (Evaluation section). No human correlation study, inter-annotator agreement benchmark, or validation against human raters is reported. Because the combinatorial scaffold guarantees only structural fidelity and not narrative depth or ethical coherence, the absence of such validation is load-bearing for the asserted utility in value alignment and educational applications.

    Authors: We agree that the lack of human validation represents a genuine limitation for claims involving narrative depth, ethical coherence, and downstream applications in value alignment or education. The combinatorial scaffold ensures structural fidelity by design, but we did not claim it guarantees depth. Our pipeline uses multiple open-weight judges from distinct families precisely to increase robustness and reduce single-model bias, with supplementary reference-free metrics. To address this, we will add a dedicated Limitations subsection in the revised Evaluation section and include a small-scale human correlation study on a random sample of 200 fables (rated by three independent human annotators for moral clarity and creativity), reporting Pearson correlations with the LLM-judge scores. revision: partial

  2. Referee: [Results section] The selection of the 8B Llama-3 variant as the best quality-cost trade-off (Results section) is determined solely by the LLM-judge scores. Without evidence that these scores correlate with human judgments of fable quality, the headline trade-off and downstream suitability claims cannot be considered robust.

    Authors: The 8B Llama-3 variant was selected based on aggregated LLM-judge scores because these enable fully reproducible, scalable comparison across all ten candidates on consumer hardware. We recognize that this makes the trade-off claim dependent on the validity of the judges as proxies. In the revision we will qualify the Results section accordingly, present the quality-cost figures with explicit caveats, and incorporate the correlation results from the planned human pilot study to provide supporting evidence for the selection while noting it remains a proxy measure. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset construction paper with no derivations or self-referential predictions

full rationale

The paper describes the generation of a 3M fable dataset via combinatorial prompts on small instruction-tuned models and an evaluation pipeline using open-weight LLM judges from distinct families. No equations, fitted parameters, predictions, or uniqueness theorems appear. Central claims (first open dataset of this scale, quality-cost trade-off, reproducibility) rest on the released code, data, and external model outputs rather than reducing to the paper's own inputs by construction. LLM-judge scores are a methodological choice whose validity can be checked externally; they do not create a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that small instruction-tuned models can reliably produce genre-faithful moral fables and that LLM-as-judge scoring is a sufficient proxy for quality; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Instruction-tuned language models no larger than 8B parameters can generate coherent, genre-faithful moral fables when guided by a combinatorial six-slot prompt scaffold.
    This premise underpins the entire generation pipeline described in the abstract.
  • domain assumption A panel of open-weight LLM judges from distinct model families can produce reliable scores for grammar, creativity, moral clarity, and template adherence.
    This premise supports the evaluation pipeline and quality claims.

pith-pipeline@v0.9.0 · 5785 in / 1546 out tokens · 52790 ms · 2026-05-22T18:58:00.258043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 9 internal anchors

  1. [1]

    A Corpus for Understanding and Generating Moral Stories, April 2022

    Jian Guan, Ziqi Liu, and Minlie Huang. A Corpus for Understanding and Generating Moral Stories, April 2022. arXiv:2204.09438 [cs]

  2. [2]

    Aesop’s Fables

    Aesop. Aesop’s Fables. OUP Oxford, July 2002. Google-Books-ID: n2LlrCeYl7gC

  3. [3]

    Luccioni, A

    Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey, June 2024. arXiv:2406.15126 [cs]

  4. [4]

    Tinystories: How small can language models be and still speak coherent english? ArXiv abs/2305.07759 (2023)

    Ronen Eldan and Yuanzhi Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?, May 2023. arXiv:2305.07759 [cs]

  5. [5]

    Textbooks Are All You Need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks Are All You Need, October 2023. arXiv:2306...

  6. [6]

    Synthetic Data Generation Using Large Language Models: Advances in Text and Code, March 2025

    Mihai Nadas, Laura Diosan, and Andreea Tomescu. Synthetic Data Generation Using Large Language Models: Advances in Text and Code, March 2025. arXiv:2503.14023 [cs]

  7. [7]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment, May 2023. arXiv:2303.16634 [cs]

  8. [8]

    GPTScore: Evaluate as You Desire

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as You Desire, February 2023. arXiv:2302.04166 [cs]

  9. [9]

    Texygen: A Bench- marking Platform for Text Generation Models

    Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A Bench- marking Platform for Text Generation Models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, pages 1097–1100, New York, NY , USA, June 2018. Association for Computing Machinery

  10. [10]

    A Diversity-Promoting Objective Function for Neural Conversation Models

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A Diversity-Promoting Objective Function for Neural Conversation Models, June 2016. arXiv:1510.03055 [cs]

  11. [11]

    J. P. Kincaid and And Others. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical report, National Technical Information Service, Springfield, Virginia 22151 (AD-A006 655/5GA, MF $2, February 1975. ERIC Number: ED108134

  12. [12]

    klusai/ds-tf1-en-3m · Datasets at Hugging Face

  13. [13]

    What Makes Good In-Context Examples for GPT-$3$?, January 2021

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What Makes Good In-Context Examples for GPT-$3$?, January 2021. arXiv:2101.06804 [cs]

  14. [14]

    Reframing Instructional Prompts to GPTk’s Language, March 2022

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. Reframing Instructional Prompts to GPTk’s Language, March 2022. arXiv:2109.07830 [cs]

  15. [15]

    Hierarchical Neural Story Generation

    Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural Story Generation, May 2018. arXiv:1805.04833 [cs]

  16. [16]

    Plan-and-Write: Towards Better Automatic Storytelling

    Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. Plan-and-Write: Towards Better Automatic Storytelling. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):7378–7385, July 2019. Number: 01

  17. [17]

    Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

    Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July 2023...

  18. [18]

    Are large language models good evaluators for abstractive summarization? arXiv preprint arXiv:2305.13091, 2023

    Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, and Lidong Bing. Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization, October 2023. arXiv:2305.13091 [cs]

  19. [19]

    Short Fiction, Flash Fiction, Microfiction

    Angela Naimou. Short Fiction, Flash Fiction, Microfiction. In Joshua Miller, editor,The Cambridge Companion to Twenty-First Century American Fiction, Cambridge Companions to Literature, pages 21–42. Cambridge University Press, Cambridge, 2021

  20. [20]

    Trading Off Diversity and Quality in Natural Language Generation

    Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan. Trading Off Diversity and Quality in Natural Language Generation. ArXiv, April 2020

  21. [21]

    HuggingFaceTB/SmolLM2-1.7B-Instruct · Hugging Face, February 2025

  22. [22]

    CohereForAI/aya-23-8B · Hugging Face, March 2025

  23. [23]

    13 A PREPRINT - APRIL 30, 2025

    meta-llama/Llama-3.2-1B-Instruct · Hugging Face, December 2024. 13 A PREPRINT - APRIL 30, 2025

  24. [24]

    meta-llama/Llama-3.1-8B-Instruct · Hugging Face, December 2024

  25. [25]

    mistralai/Mistral-7B-Instruct-v0.3 · Hugging Face

  26. [26]

    Qwen/Qwen2.5-7B-Instruct · Hugging Face, February 2025

  27. [27]

    deepseek-ai/deepseek-llm-7b-chat · Hugging Face, August 2024

  28. [28]

    microsoft/Phi-3-mini-4k-instruct · Hugging Face, January 2025

  29. [29]

    tiiuae/Falcon3-7B-Instruct · Hugging Face, February 2025

  30. [30]

    Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations, October 2023

    Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations, October 2023. arXiv:2310.07849 [cs]

  31. [31]

    Scaling Synthetic Data Creation with 1,000,000,000 Personas

    Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling Synthetic Data Creation with 1,000,000,000 Personas, September 2024. arXiv:2406.20094 [cs]

  32. [32]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

  33. [33]

    M. O. Riedl and R. M. Young. Narrative Planning: Balancing Plot and Character. Journal of Artificial Intelligence Research, 39:217–268, September 2010

  34. [34]

    A Comparative Study of Quality Evaluation Methods for Text Summarization, June 2024

    Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, and Junhua Ding. A Comparative Study of Quality Evaluation Methods for Text Summarization, June 2024. arXiv:2407.00747 [cs]

  35. [35]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models, January 2020. arXiv:2001.08361 [cs]

  36. [36]

    Datasheets for datasets

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, December 2021

  37. [37]

    original-date: 2025-01-07T20:20:42Z

    klusai/tinyfabulist, April 2025. original-date: 2025-01-07T20:20:42Z

  38. [38]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

  39. [39]

    A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

    Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Com...

  40. [40]

    Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A. Raffel. Scaling Data-Constrained Language Models. Advances in Neural Information Processing Systems, 36:50358–50376, December 2023

  41. [41]

    Hwang, Maxwell Forbes, and Yejin Choi

    Denis Emelin, Ronan Le Bras, Jena D. Hwang, Maxwell Forbes, and Yejin Choi. Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences, December 2020. arXiv:2012.15738 [cs]

  42. [42]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, December 2023. arXiv:2306.05685 [cs]

  43. [43]

    AI Chip - AWS Inferentia - AWS

  44. [44]

    NVIDIA A100 GPUs Power the Modern Data Center

  45. [45]

    NVIDIA L4 Tensor Core GPU. A Hardware and Environment Configurations We benchmarked inference for the TF1 -EN-3M dataset under several GPU configurations using Llama-3.1-8B-Instruct with identical prompts and decoding settings. All experiments were executed on Hugging Face Inference Endpoints; the hourly tariffs advertised by Hugging Face in April 2025 we...