TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

Andreea Tomescu; Andrei Piscoran; Laura Diosan; Mihai Nadas

arxiv: 2504.20605 · v2 · submitted 2025-04-29 · 💻 cs.CL · cs.AI· cs.DL· cs.LG

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

Mihai Nadas , Laura Diosan , Andrei Piscoran , Andreea Tomescu This is my paper

Pith reviewed 2026-05-22 18:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DLcs.LG

keywords synthetic moral fablessmall language modelsdataset creationinstruction-tuned modelsLLM judgesvalue alignmentnarrative generationopen source data

0 comments

The pith

Three million moral fables can be generated by instruction-tuned models no larger than 8 billion parameters through a fixed six-part template.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a large open dataset to fill the gap in structured moral narratives that pair stories with explicit ethical lessons. It uses a combinatorial system to produce three million English fables, each built from character, trait, setting, conflict, resolution, and moral slots. Quality is measured by a panel of small open models that score grammar, creativity, moral clarity, and template fit, with one 8B Llama variant emerging as the strongest performer at low cost. The authors release the full dataset, generation code, and evaluation scripts so others can reproduce the work exactly. This setup lets researchers explore narrative generation and value alignment using only accessible open tools instead of large closed systems.

Core claim

The authors establish that a combinatorial prompt engine applied to instruction-tuned models no larger than 8B parameters can produce three million coherent moral fables, each following a character-trait-setting-conflict-resolution-moral scaffold, and that a panel of open-weight judges from different families can score these fables reliably enough to select high-quality outputs and benchmark costs at roughly $0.135 per thousand stories.

What carries the argument

A combinatorial prompt engine that fills a six-slot scaffold of character, trait, setting, conflict, resolution, and moral to enforce genre consistency across a broad range of themes.

If this is right

Small open models can now serve as practical generators of large moral story corpora without proprietary infrastructure.
The released evaluation scripts provide a reproducible baseline for measuring narrative quality and ethical content in synthetic text.
The dataset supplies ready training material for studying how language models acquire instruction following and value alignment.
Cost figures demonstrate that consumer-grade hardware suffices for producing thousands of structured educational stories at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scaffold method might be reused to generate moral stories in additional languages or for specialized ethical domains.
Fine-tuning small models on this corpus could be tested for downstream gains in coherent storytelling or ethical reasoning benchmarks.
The dataset offers a controlled starting point for experiments that mix synthetic moral fables with human-written examples.

Load-bearing premise

That scores from open LLM judges drawn from different families reliably match what humans would say about moral clarity and template adherence in the generated fables.

What would settle it

A human rating study on a random sample of the fables that shows low correlation between the LLM judge scores for moral clarity and the ratings given by people.

Figures

Figures reproduced from arXiv: 2504.20605 by Andreea Tomescu, Andrei Piscoran, Laura Diosan, Mihai Nadas.

**Figure 1.** Figure 1: Full pipeline for generating TF1-EN-3M. • Adherence to the classic fable format, composed of six core elements: character, trait, setting, conflict, resolution, and moral. It also defined five distinct age groups (A–E) used later during evaluation to assess target audience alignment. These audience categories helped our LLM-based critic assign each story to an appropriate demographic bracket. The system me… view at source ↗

read the original abstract

Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We present TF1-EN-3M, to our knowledge the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A fully reproducible evaluation pipeline employs a panel of open-weight LLM judges from distinct model families, scoring grammar, creativity, moral clarity, and template adherence, complemented by reference-free diversity and readability metrics. Among ten open-weight generator candidates, an 8B-parameter Llama-3 variant delivers the best quality-cost trade-off, producing high-scoring fables on consumer hardware at approximately $0.135 per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI -- demonstrating that large-scale moral storytelling requires neither proprietary giant models nor proprietary evaluation infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper ships a 3M open moral-fable dataset from small models with full code and cost numbers, but the quality claims depend on LLM judges that lack human checks.

read the letter

The main point is a new release of three million English moral fables, each built from a fixed six-part template and generated only by open models of 8B parameters or less. The authors used a combinatorial prompt engine to vary characters, traits, settings, and morals while keeping the fable form intact, then scored the output with a panel of other open LLMs on grammar, creativity, moral clarity, and template fit. They report that one Llama-3 8B variant gave the best balance and could run cheaply on ordinary hardware, with explicit numbers like $0.135 per thousand stories. Everything is released with generation scripts, evaluation code, and metadata under a permissive license.

Referee Report

2 major / 3 minor

Summary. The manuscript presents TF1-EN-3M, a dataset of three million English-language moral fables generated exclusively by instruction-tuned models no larger than 8B parameters. Generation uses a combinatorial prompt engine enforcing a fixed six-slot scaffold (character, trait, setting, conflict, resolution, moral) to ensure genre fidelity and broad thematic coverage. A reproducible evaluation pipeline applies open-weight LLM judges from distinct families to score grammar, creativity, moral clarity, and template adherence, supplemented by reference-free diversity and readability metrics. The authors identify an 8B Llama-3 variant as providing the best quality-cost trade-off at roughly $0.135 per 1,000 fables on consumer hardware and release the full dataset, generation code, evaluation scripts, and metadata under a permissive license.

Significance. If the quality and diversity claims are substantiated, TF1-EN-3M would constitute a valuable open resource for research on narrative generation, instruction following, value alignment, and child-friendly educational AI. The exclusive use of small open-weight models, fully reproducible pipeline, and public release of all components (including cost benchmarking) strengthen its potential impact by lowering barriers to entry and enabling exact replication without proprietary infrastructure.

major comments (2)

[Evaluation section] The central quality claims for moral clarity and template adherence rest exclusively on scores produced by the panel of open-weight LLM judges (Evaluation section). No human correlation study, inter-annotator agreement benchmark, or validation against human raters is reported. Because the combinatorial scaffold guarantees only structural fidelity and not narrative depth or ethical coherence, the absence of such validation is load-bearing for the asserted utility in value alignment and educational applications.
[Results section] The selection of the 8B Llama-3 variant as the best quality-cost trade-off (Results section) is determined solely by the LLM-judge scores. Without evidence that these scores correlate with human judgments of fable quality, the headline trade-off and downstream suitability claims cannot be considered robust.

minor comments (3)

[Generation section] Clarify the exact procedure used to sample the 3M combinations from the combinatorial engine and report any deduplication or diversity-enforcement steps beyond the aggregate metrics.
[Evaluation section] The readability and diversity metrics would benefit from explicit comparison to a small set of human-written fables or established literary corpora to provide context for the reported values.
Ensure every table and figure is referenced in the main text and that axis labels and legends are fully legible without reference to the caption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation methodology and results. We address each major point below, acknowledging the limitations of relying solely on LLM judges while proposing targeted revisions to improve robustness without altering the core contribution of the open, reproducible dataset and pipeline.

read point-by-point responses

Referee: [Evaluation section] The central quality claims for moral clarity and template adherence rest exclusively on scores produced by the panel of open-weight LLM judges (Evaluation section). No human correlation study, inter-annotator agreement benchmark, or validation against human raters is reported. Because the combinatorial scaffold guarantees only structural fidelity and not narrative depth or ethical coherence, the absence of such validation is load-bearing for the asserted utility in value alignment and educational applications.

Authors: We agree that the lack of human validation represents a genuine limitation for claims involving narrative depth, ethical coherence, and downstream applications in value alignment or education. The combinatorial scaffold ensures structural fidelity by design, but we did not claim it guarantees depth. Our pipeline uses multiple open-weight judges from distinct families precisely to increase robustness and reduce single-model bias, with supplementary reference-free metrics. To address this, we will add a dedicated Limitations subsection in the revised Evaluation section and include a small-scale human correlation study on a random sample of 200 fables (rated by three independent human annotators for moral clarity and creativity), reporting Pearson correlations with the LLM-judge scores. revision: partial
Referee: [Results section] The selection of the 8B Llama-3 variant as the best quality-cost trade-off (Results section) is determined solely by the LLM-judge scores. Without evidence that these scores correlate with human judgments of fable quality, the headline trade-off and downstream suitability claims cannot be considered robust.

Authors: The 8B Llama-3 variant was selected based on aggregated LLM-judge scores because these enable fully reproducible, scalable comparison across all ten candidates on consumer hardware. We recognize that this makes the trade-off claim dependent on the validity of the judges as proxies. In the revision we will qualify the Results section accordingly, present the quality-cost figures with explicit caveats, and incorporate the correlation results from the planned human pilot study to provide supporting evidence for the selection while noting it remains a proxy measure. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset construction paper with no derivations or self-referential predictions

full rationale

The paper describes the generation of a 3M fable dataset via combinatorial prompts on small instruction-tuned models and an evaluation pipeline using open-weight LLM judges from distinct families. No equations, fitted parameters, predictions, or uniqueness theorems appear. Central claims (first open dataset of this scale, quality-cost trade-off, reproducibility) rest on the released code, data, and external model outputs rather than reducing to the paper's own inputs by construction. LLM-judge scores are a methodological choice whose validity can be checked externally; they do not create a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that small instruction-tuned models can reliably produce genre-faithful moral fables and that LLM-as-judge scoring is a sufficient proxy for quality; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Instruction-tuned language models no larger than 8B parameters can generate coherent, genre-faithful moral fables when guided by a combinatorial six-slot prompt scaffold.
This premise underpins the entire generation pipeline described in the abstract.
domain assumption A panel of open-weight LLM judges from distinct model families can produce reliable scores for grammar, creativity, moral clarity, and template adherence.
This premise supports the evaluation pipeline and quality claims.

pith-pipeline@v0.9.0 · 5785 in / 1546 out tokens · 52790 ms · 2026-05-22T18:58:00.258043+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 9 internal anchors

[1]

A Corpus for Understanding and Generating Moral Stories, April 2022

Jian Guan, Ziqi Liu, and Minlie Huang. A Corpus for Understanding and Generating Moral Stories, April 2022. arXiv:2204.09438 [cs]

work page arXiv 2022
[2]

Aesop’s Fables

Aesop. Aesop’s Fables. OUP Oxford, July 2002. Google-Books-ID: n2LlrCeYl7gC

work page 2002
[3]

Luccioni, A

Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey, June 2024. arXiv:2406.15126 [cs]

work page arXiv 2024
[4]

Tinystories: How small can language models be and still speak coherent english? ArXiv abs/2305.07759 (2023)

Ronen Eldan and Yuanzhi Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?, May 2023. arXiv:2305.07759 [cs]

work page arXiv 2023
[5]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks Are All You Need, October 2023. arXiv:2306...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Synthetic Data Generation Using Large Language Models: Advances in Text and Code, March 2025

Mihai Nadas, Laura Diosan, and Andreea Tomescu. Synthetic Data Generation Using Large Language Models: Advances in Text and Code, March 2025. arXiv:2503.14023 [cs]

work page arXiv 2025
[7]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment, May 2023. arXiv:2303.16634 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as You Desire, February 2023. arXiv:2302.04166 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Texygen: A Bench- marking Platform for Text Generation Models

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A Bench- marking Platform for Text Generation Models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, pages 1097–1100, New York, NY , USA, June 2018. Association for Computing Machinery

work page 2018
[10]

A Diversity-Promoting Objective Function for Neural Conversation Models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A Diversity-Promoting Objective Function for Neural Conversation Models, June 2016. arXiv:1510.03055 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

J. P. Kincaid and And Others. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical report, National Technical Information Service, Springfield, Virginia 22151 (AD-A006 655/5GA, MF $2, February 1975. ERIC Number: ED108134

work page 1975
[12]

klusai/ds-tf1-en-3m · Datasets at Hugging Face

work page
[13]

What Makes Good In-Context Examples for GPT-$3$?, January 2021

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What Makes Good In-Context Examples for GPT-$3$?, January 2021. arXiv:2101.06804 [cs]

work page arXiv 2021
[14]

Reframing Instructional Prompts to GPTk’s Language, March 2022

Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. Reframing Instructional Prompts to GPTk’s Language, March 2022. arXiv:2109.07830 [cs]

work page arXiv 2022
[15]

Hierarchical Neural Story Generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural Story Generation, May 2018. arXiv:1805.04833 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Plan-and-Write: Towards Better Automatic Storytelling

Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. Plan-and-Write: Towards Better Automatic Storytelling. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):7378–7385, July 2019. Number: 01

work page 2019
[17]

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July 2023...

work page 2023
[18]

Are large language models good evaluators for abstractive summarization? arXiv preprint arXiv:2305.13091, 2023

Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, and Lidong Bing. Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization, October 2023. arXiv:2305.13091 [cs]

work page arXiv 2023
[19]

Short Fiction, Flash Fiction, Microfiction

Angela Naimou. Short Fiction, Flash Fiction, Microfiction. In Joshua Miller, editor,The Cambridge Companion to Twenty-First Century American Fiction, Cambridge Companions to Literature, pages 21–42. Cambridge University Press, Cambridge, 2021

work page 2021
[20]

Trading Off Diversity and Quality in Natural Language Generation

Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan. Trading Off Diversity and Quality in Natural Language Generation. ArXiv, April 2020

work page 2020
[21]

HuggingFaceTB/SmolLM2-1.7B-Instruct · Hugging Face, February 2025

work page 2025
[22]

CohereForAI/aya-23-8B · Hugging Face, March 2025

work page 2025
[23]

13 A PREPRINT - APRIL 30, 2025

meta-llama/Llama-3.2-1B-Instruct · Hugging Face, December 2024. 13 A PREPRINT - APRIL 30, 2025

work page 2024
[24]

meta-llama/Llama-3.1-8B-Instruct · Hugging Face, December 2024

work page 2024
[25]

mistralai/Mistral-7B-Instruct-v0.3 · Hugging Face

work page
[26]

Qwen/Qwen2.5-7B-Instruct · Hugging Face, February 2025

work page 2025
[27]

deepseek-ai/deepseek-llm-7b-chat · Hugging Face, August 2024

work page 2024
[28]

microsoft/Phi-3-mini-4k-instruct · Hugging Face, January 2025

work page 2025
[29]

tiiuae/Falcon3-7B-Instruct · Hugging Face, February 2025

work page 2025
[30]

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations, October 2023

Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations, October 2023. arXiv:2310.07849 [cs]

work page arXiv 2023
[31]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling Synthetic Data Creation with 1,000,000,000 Personas, September 2024. arXiv:2406.20094 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

work page 2023
[33]

M. O. Riedl and R. M. Young. Narrative Planning: Balancing Plot and Character. Journal of Artificial Intelligence Research, 39:217–268, September 2010

work page 2010
[34]

A Comparative Study of Quality Evaluation Methods for Text Summarization, June 2024

Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, and Junhua Ding. A Comparative Study of Quality Evaluation Methods for Text Summarization, June 2024. arXiv:2407.00747 [cs]

work page arXiv 2024
[35]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models, January 2020. arXiv:2001.08361 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[36]

Datasheets for datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, December 2021

work page 2021
[37]

original-date: 2025-01-07T20:20:42Z

klusai/tinyfabulist, April 2025. original-date: 2025-01-07T20:20:42Z

work page 2025
[38]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Com...

work page 2016
[40]

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A. Raffel. Scaling Data-Constrained Language Models. Advances in Neural Information Processing Systems, 36:50358–50376, December 2023

work page 2023
[41]

Hwang, Maxwell Forbes, and Yejin Choi

Denis Emelin, Ronan Le Bras, Jena D. Hwang, Maxwell Forbes, and Yejin Choi. Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences, December 2020. arXiv:2012.15738 [cs]

work page arXiv 2020
[42]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, December 2023. arXiv:2306.05685 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

AI Chip - AWS Inferentia - AWS

work page
[44]

NVIDIA A100 GPUs Power the Modern Data Center

work page
[45]

NVIDIA L4 Tensor Core GPU. A Hardware and Environment Configurations We benchmarked inference for the TF1 -EN-3M dataset under several GPU configurations using Llama-3.1-8B-Instruct with identical prompts and decoding settings. All experiments were executed on Hugging Face Inference Endpoints; the hourly tariffs advertised by Hugging Face in April 2025 we...

work page 2025

[1] [1]

A Corpus for Understanding and Generating Moral Stories, April 2022

Jian Guan, Ziqi Liu, and Minlie Huang. A Corpus for Understanding and Generating Moral Stories, April 2022. arXiv:2204.09438 [cs]

work page arXiv 2022

[2] [2]

Aesop’s Fables

Aesop. Aesop’s Fables. OUP Oxford, July 2002. Google-Books-ID: n2LlrCeYl7gC

work page 2002

[3] [3]

Luccioni, A

Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey, June 2024. arXiv:2406.15126 [cs]

work page arXiv 2024

[4] [4]

Tinystories: How small can language models be and still speak coherent english? ArXiv abs/2305.07759 (2023)

Ronen Eldan and Yuanzhi Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?, May 2023. arXiv:2305.07759 [cs]

work page arXiv 2023

[5] [5]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks Are All You Need, October 2023. arXiv:2306...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Synthetic Data Generation Using Large Language Models: Advances in Text and Code, March 2025

Mihai Nadas, Laura Diosan, and Andreea Tomescu. Synthetic Data Generation Using Large Language Models: Advances in Text and Code, March 2025. arXiv:2503.14023 [cs]

work page arXiv 2025

[7] [7]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment, May 2023. arXiv:2303.16634 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as You Desire, February 2023. arXiv:2302.04166 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Texygen: A Bench- marking Platform for Text Generation Models

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A Bench- marking Platform for Text Generation Models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, pages 1097–1100, New York, NY , USA, June 2018. Association for Computing Machinery

work page 2018

[10] [10]

A Diversity-Promoting Objective Function for Neural Conversation Models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A Diversity-Promoting Objective Function for Neural Conversation Models, June 2016. arXiv:1510.03055 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

J. P. Kincaid and And Others. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical report, National Technical Information Service, Springfield, Virginia 22151 (AD-A006 655/5GA, MF $2, February 1975. ERIC Number: ED108134

work page 1975

[12] [12]

klusai/ds-tf1-en-3m · Datasets at Hugging Face

work page

[13] [13]

What Makes Good In-Context Examples for GPT-$3$?, January 2021

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What Makes Good In-Context Examples for GPT-$3$?, January 2021. arXiv:2101.06804 [cs]

work page arXiv 2021

[14] [14]

Reframing Instructional Prompts to GPTk’s Language, March 2022

Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. Reframing Instructional Prompts to GPTk’s Language, March 2022. arXiv:2109.07830 [cs]

work page arXiv 2022

[15] [15]

Hierarchical Neural Story Generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural Story Generation, May 2018. arXiv:1805.04833 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Plan-and-Write: Towards Better Automatic Storytelling

Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. Plan-and-Write: Towards Better Automatic Storytelling. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):7378–7385, July 2019. Number: 01

work page 2019

[17] [17]

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July 2023...

work page 2023

[18] [18]

Are large language models good evaluators for abstractive summarization? arXiv preprint arXiv:2305.13091, 2023

Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, and Lidong Bing. Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization, October 2023. arXiv:2305.13091 [cs]

work page arXiv 2023

[19] [19]

Short Fiction, Flash Fiction, Microfiction

Angela Naimou. Short Fiction, Flash Fiction, Microfiction. In Joshua Miller, editor,The Cambridge Companion to Twenty-First Century American Fiction, Cambridge Companions to Literature, pages 21–42. Cambridge University Press, Cambridge, 2021

work page 2021

[20] [20]

Trading Off Diversity and Quality in Natural Language Generation

Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan. Trading Off Diversity and Quality in Natural Language Generation. ArXiv, April 2020

work page 2020

[21] [21]

HuggingFaceTB/SmolLM2-1.7B-Instruct · Hugging Face, February 2025

work page 2025

[22] [22]

CohereForAI/aya-23-8B · Hugging Face, March 2025

work page 2025

[23] [23]

13 A PREPRINT - APRIL 30, 2025

meta-llama/Llama-3.2-1B-Instruct · Hugging Face, December 2024. 13 A PREPRINT - APRIL 30, 2025

work page 2024

[24] [24]

meta-llama/Llama-3.1-8B-Instruct · Hugging Face, December 2024

work page 2024

[25] [25]

mistralai/Mistral-7B-Instruct-v0.3 · Hugging Face

work page

[26] [26]

Qwen/Qwen2.5-7B-Instruct · Hugging Face, February 2025

work page 2025

[27] [27]

deepseek-ai/deepseek-llm-7b-chat · Hugging Face, August 2024

work page 2024

[28] [28]

microsoft/Phi-3-mini-4k-instruct · Hugging Face, January 2025

work page 2025

[29] [29]

tiiuae/Falcon3-7B-Instruct · Hugging Face, February 2025

work page 2025

[30] [30]

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations, October 2023

Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations, October 2023. arXiv:2310.07849 [cs]

work page arXiv 2023

[31] [31]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling Synthetic Data Creation with 1,000,000,000 Personas, September 2024. arXiv:2406.20094 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

work page 2023

[33] [33]

M. O. Riedl and R. M. Young. Narrative Planning: Balancing Plot and Character. Journal of Artificial Intelligence Research, 39:217–268, September 2010

work page 2010

[34] [34]

A Comparative Study of Quality Evaluation Methods for Text Summarization, June 2024

Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, and Junhua Ding. A Comparative Study of Quality Evaluation Methods for Text Summarization, June 2024. arXiv:2407.00747 [cs]

work page arXiv 2024

[35] [35]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models, January 2020. arXiv:2001.08361 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[36] [36]

Datasheets for datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, December 2021

work page 2021

[37] [37]

original-date: 2025-01-07T20:20:42Z

klusai/tinyfabulist, April 2025. original-date: 2025-01-07T20:20:42Z

work page 2025

[38] [38]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Com...

work page 2016

[40] [40]

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A. Raffel. Scaling Data-Constrained Language Models. Advances in Neural Information Processing Systems, 36:50358–50376, December 2023

work page 2023

[41] [41]

Hwang, Maxwell Forbes, and Yejin Choi

Denis Emelin, Ronan Le Bras, Jena D. Hwang, Maxwell Forbes, and Yejin Choi. Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences, December 2020. arXiv:2012.15738 [cs]

work page arXiv 2020

[42] [42]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, December 2023. arXiv:2306.05685 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

AI Chip - AWS Inferentia - AWS

work page

[44] [44]

NVIDIA A100 GPUs Power the Modern Data Center

work page

[45] [45]

NVIDIA L4 Tensor Core GPU. A Hardware and Environment Configurations We benchmarked inference for the TF1 -EN-3M dataset under several GPU configurations using Llama-3.1-8B-Instruct with identical prompts and decoding settings. All experiments were executed on Hugging Face Inference Endpoints; the hourly tariffs advertised by Hugging Face in April 2025 we...

work page 2025