BioGen: Automated Biography Generation

Ayush Garg; Heer Ambavi; Jayesh Choudhari; Mayank Singh; Mridul Sharma; Nitiksha; Rohit Sharma

arxiv: 1906.11405 · v1 · pith:KUE6N5PWnew · submitted 2019-06-27 · 💻 cs.DL

BioGen: Automated Biography Generation

Heer Ambavi , Ayush Garg , Nitiksha , Mridul Sharma , Rohit Sharma , Jayesh Choudhari , Mayank Singh This is my paper

Pith reviewed 2026-05-25 14:20 UTC · model grok-4.3

classification 💻 cs.DL

keywords biography generationautomatic text generationWikipedianatural language processingevent clusteringbiographical sentencesencyclopedic content

0 comments

The pith

BioGen generates short collections of biographical sentences clustered by life events that evaluation shows are significantly closer to Wikipedia entries than manual curation delays would allow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BioGen as an automatic framework for creating biographies that describe education, work, relationships, and other life events. It addresses the delays inherent in manual Wikipedia curation by clustering generated sentences into multiple event categories. The central result is that these automated outputs measure as significantly closer to existing Wikipedia biographies than would be expected from the curation bottleneck alone. A working implementation is provided online for direct use. This matters because it offers a scalable way to expand and update encyclopedic coverage without relying solely on human editors.

Core claim

BioGen is an automatic biography generation framework that produces a short collection of biographical sentences clustered into multiple events of life, and evaluation results show that biographies generated by BioGen are significantly closer to manually written biographies in Wikipedia.

What carries the argument

BioGen, the automatic biography generation framework that clusters biographical sentences into life-event groups.

If this is right

Wikipedia-style biographies for newly prominent individuals could be produced without waiting for manual curation.
Existing biographies could be extended with new life events using the same clustering process.
The framework could reduce the overall backlog of uncovered notable people in encyclopedias.
Automated updates become feasible whenever new verifiable events occur.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The event-clustering step might transfer to generating other structured narrative texts such as timelines or career summaries.
Pairing BioGen with live data feeds could keep biographies current in near real time.
Similar sentence-clustering techniques could apply to domains outside biography, such as company histories or scientific career overviews.

Load-bearing premise

The comparison metrics and data sources used in the evaluation accurately capture closeness to Wikipedia biographies without introducing bias from the choice of baselines or test cases.

What would settle it

A controlled study in which human raters consistently judge BioGen outputs as less similar in style, accuracy, or completeness to Wikipedia biographies than the paper's automatic metrics indicate would falsify the closeness claim.

Figures

Figures reproduced from arXiv: 1906.11405 by Ayush Garg, Heer Ambavi, Jayesh Choudhari, Mayank Singh, Mridul Sharma, Nitiksha, Rohit Sharma.

**Figure 4.** Figure 4: Change in ROUGE score with changing ratio of lengths of BioGen generated and Wikipedia biographies. 8hps://en.wikipedia.org/wiki/Amitabh Bachchan 3 [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

read the original abstract

A biography of a person is the detailed description of several life events including his education, work, relationships, and death. Wikipedia, the free web-based encyclopedia, consists of millions of manually curated biographies of eminent politicians, film and sports personalities, etc. However, manual curation efforts, even though efficient, suffers from significant delays. In this work, we propose an automatic biography generation framework BioGen. BioGen generates a short collection of biographical sentences clustered into multiple events of life. Evaluation results show that biographies generated by BioGen are significantly closer to manually written biographies in Wikipedia. A working model of this framework is available at nlpbiogen.herokuapp.com/home/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BioGen names a biography generator and claims closer Wikipedia matches, but the abstract (and thus the paper's core) supplies no method, metrics, or baselines, leaving the result uncheckable.

read the letter

The paper's main contribution is a named system, BioGen, that turns out short biographical sentences grouped by life events like education or work. It points to a live demo. That's the concrete part. Everything else is missing from the abstract and the stress-test note flags the same gap: no description of how the sentences are produced, what similarity measure is used, what baselines are compared against, or how the test set was chosen to avoid leakage from Wikipedia itself. The claim that the outputs are 'significantly closer' therefore sits without evidence that can be inspected or reproduced. If the full text adds those details it would change the picture, but on what is shown the evaluation protocol is unspecified. This is not a minor omission; it is the load-bearing part of the result. The work is a narrow automation sketch rather than a tested advance. It might interest someone who wants a quick demo for generating short person summaries, but it does not give researchers a method they can extend or a result they can cite with confidence. I would not bring it to a reading group and would not cite it. It does not look ready for peer review because there is no technical substance or verifiable evaluation to referee.

Referee Report

1 major / 1 minor

Summary. The paper proposes BioGen, an automatic biography generation framework that produces a short collection of biographical sentences clustered into multiple life events. It claims that evaluation results show the generated biographies are significantly closer to manually written Wikipedia biographies, and provides a demo URL.

Significance. If the evaluation protocol were properly specified and sound, the work would address delays in manual biography curation and offer a practical contribution to automated content generation in digital libraries and encyclopedias. The current manuscript, however, provides no basis for assessing whether this contribution is realized.

major comments (1)

Abstract: the central claim that 'biographies generated by BioGen are significantly closer to manually written biographies in Wikipedia' is unsupported because the manuscript provides no description of the similarity metric, baselines, test set construction, held-out data, or statistical testing. Without these elements the empirical result cannot be verified or reproduced.

minor comments (1)

Abstract: 'manual curation efforts, even though efficient, suffers from significant delays' contains a subject-verb agreement error.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: the central claim that 'biographies generated by BioGen are significantly closer to manually written biographies in Wikipedia' is unsupported because the manuscript provides no description of the similarity metric, baselines, test set construction, held-out data, or statistical testing. Without these elements the empirical result cannot be verified or reproduced.

Authors: We agree that the abstract (and evaluation section) lacks explicit details on the similarity metric, baselines, test set construction, held-out data, and statistical testing, which prevents verification and reproduction of the central claim. The manuscript describes an evaluation but does not specify these elements sufficiently. We will revise the manuscript to add a complete description of the evaluation protocol, including the similarity metric, baselines, test set details, held-out data usage, and statistical tests, and will update the abstract to summarize these elements. revision: yes

Circularity Check

0 steps flagged

No circularity: paper describes an NLP system with no derivations or predictions to inspect.

full rationale

The manuscript presents BioGen as an automatic biography generation framework whose central claim is an empirical evaluation result (biographies are 'significantly closer' to Wikipedia). No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or the described structure. The patterns enumerated for circularity (self-definitional claims, fitted inputs called predictions, uniqueness theorems, ansatzes smuggled via citation, etc.) have no matching instances because the work contains no mathematical derivation chain at all. The evaluation claim may be under-specified, but that is a correctness/verifiability issue, not circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5647 in / 889 out tokens · 26527 ms · 2026-05-25T14:20:21.302826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 3 internal anchors

[1]

Amini, Nicolas Usunier, and Cyril Gou/t_te

Massih R. Amini, Nicolas Usunier, and Cyril Gou/t_te. 2009. Learning from Multiple Partially Observed Views -an Application to Multilingual Text Categorization. In Proceedings of the 22Nd International Conference on Neural Information Processing Systems (NIPS’09). Curran Associates Inc., USA, 28–36. h/t_tp://dl.acm.org/citation. cfm?id=2984093.2984097

work page arXiv 2009
[2]

Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. 2001. Sentence Ordering in Multidocument Summarization. In Proceedings of the First Inter- national Conference on Human Language Technology Research (HLT ’01) . As- sociation for Computational Linguistics, Stroudsburg, PA, USA, 1–7. DOI: h/t_tp://dx.doi.org/10.3115/1072133.1072217

work page doi:10.3115/1072133.1072217 2001
[3]

Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008. An unsupervised ap- proach to biography production using wikipedia. Proceedings of ACL-08: HLT (2008), 807–815

work page 2008
[4]

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python (1st ed.). O’Reilly Media, Inc

work page 2009
[5]

Elena Filatova and John Prager. 2005. Tell Me What You Do and I’Ll Tell You What You Are: Learning Occupation-related Activities for Biographies. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT ’05). Association for Computational Linguistics, Stroudsburg, PA, USA, 113–120. DOI:h/t_tp:...

work page doi:10.3115/1220575.1220590 2005
[6]

R´emi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing

work page 2004
[9]

Radim ˇReh˚uˇrek and Petr Sojka. 2010. So/f_tware Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valle/t_ta, Malta, 45–50. h/t_tp://is.muni.cz/publication/ 884893/en

work page 2010
[10]

Liang Zhou, Miruna Ticrea, and Eduard Hovy. 2005. Multi-document biography summarization. arXiv preprint cs/0501078 (2005). 4

work page internal anchor Pith review Pith/arXiv arXiv 2005

[1] [1]

Amini, Nicolas Usunier, and Cyril Gou/t_te

Massih R. Amini, Nicolas Usunier, and Cyril Gou/t_te. 2009. Learning from Multiple Partially Observed Views -an Application to Multilingual Text Categorization. In Proceedings of the 22Nd International Conference on Neural Information Processing Systems (NIPS’09). Curran Associates Inc., USA, 28–36. h/t_tp://dl.acm.org/citation. cfm?id=2984093.2984097

work page arXiv 2009

[2] [2]

Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. 2001. Sentence Ordering in Multidocument Summarization. In Proceedings of the First Inter- national Conference on Human Language Technology Research (HLT ’01) . As- sociation for Computational Linguistics, Stroudsburg, PA, USA, 1–7. DOI: h/t_tp://dx.doi.org/10.3115/1072133.1072217

work page doi:10.3115/1072133.1072217 2001

[3] [3]

Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008. An unsupervised ap- proach to biography production using wikipedia. Proceedings of ACL-08: HLT (2008), 807–815

work page 2008

[4] [4]

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python (1st ed.). O’Reilly Media, Inc

work page 2009

[5] [5]

Elena Filatova and John Prager. 2005. Tell Me What You Do and I’Ll Tell You What You Are: Learning Occupation-related Activities for Biographies. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT ’05). Association for Computational Linguistics, Stroudsburg, PA, USA, 113–120. DOI:h/t_tp:...

work page doi:10.3115/1220575.1220590 2005

[6] [6]

R´emi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing

work page 2004

[9] [9]

Radim ˇReh˚uˇrek and Petr Sojka. 2010. So/f_tware Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valle/t_ta, Malta, 45–50. h/t_tp://is.muni.cz/publication/ 884893/en

work page 2010

[10] [10]

Liang Zhou, Miruna Ticrea, and Eduard Hovy. 2005. Multi-document biography summarization. arXiv preprint cs/0501078 (2005). 4

work page internal anchor Pith review Pith/arXiv arXiv 2005