Texygen: A Benchmarking Platform for Text Generation Models

Jiaxian Guo; Jun Wang; Lei Zheng; Sidi Lu; Weinan Zhang; Yaoming Zhu; Yong Yu

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1802.01886 v1 pith:IY32JLVD submitted 2018-02-06 cs.CL cs.IRcs.LG

Texygen: A Benchmarking Platform for Text Generation Models

Yaoming Zhu , Sidi Lu , Lei Zheng , Jiaxian Guo , Weinan Zhang , Jun Wang , Yong Yu This is my paper

classification cs.CL cs.IRcs.LG

keywords generationtexttexygenmodelsplatformresearchbenchmarkinghelp

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

We introduce Texygen, a benchmarking platform to support research on open-domain text generation models. Texygen has not only implemented a majority of text generation models, but also covered a set of metrics that evaluate the diversity, the quality and the consistency of the generated texts. The Texygen platform could help standardize the research on text generation and facilitate the sharing of fine-tuned open-source implementations among researchers for their work. As a consequence, this would help in improving the reproductivity and reliability of future research work in text generation.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Before and After Temperature: A Distributional View of Creative LLM Generation
cs.CL 2026-05 unverdicted novelty 7.0

A per-token feature from temperature-induced changes in LLM token distributions predicts within-prompt creativity rank at Spearman rho 0.918 vs LLM judges and 0.870 vs humans, outperforming perplexity, entropy, top-1 ...
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise
cs.CL 2026-06 unverdicted novelty 6.0

Decan (D_Ca_n = C × a_n) measures text diversity as progressive conditional surprise from base LM log-probabilities, scoring 0.846 OCA on McDiv benchmark and detecting monotonic diversity drop across base→SFT→DPO→RLVR stages.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
Evaluating Computational Language Models with Scaling Properties of Natural Language
cs.CL 2019-06 unverdicted novelty 5.0

Only gated RNN language models reproduce the long-range correlation scaling of natural language among tested models, with Taylor's law exponent serving as a quality indicator.