Automatic Generation of Titles for Research Papers Using Language Models
Pith reviewed 2026-06-28 06:42 UTC · model grok-4.3
The pith
Fine-tuned PEGASUS-large generates research paper titles from abstracts more accurately than LLaMA-3-8B or zero-shot GPT-3.5-turbo across standard metrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that fine-tuned PEGASUS-large outperforms fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo on ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore when generating titles from abstracts on the CSPubSum, LREC-COLING-2024, and new SpringerSSAT datasets, and that the resulting titles are generally appropriate and reliable.
What carries the argument
Fine-tuned PEGASUS-large applied to abstract-to-title generation, evaluated by overlap and embedding-based metrics.
If this is right
- Authors in computer science and social sciences can use the fine-tuned PEGASUS model as a practical assistant when drafting titles.
- The SpringerSSAT dataset provides additional training material for future title-generation work in the social sciences.
- Zero-shot prompting of GPT-3.5-turbo is shown to be less competitive than fine-tuned smaller models on the chosen metrics.
- ChatGPT can be prompted to produce creative title alternatives that differ from the more literal outputs of the fine-tuned models.
Where Pith is reading between the lines
- The same fine-tuning approach could be tested on generating titles from other sections such as conclusions or introductions.
- If automatic metrics correlate poorly with human preference, future work would need to collect direct human ratings to guide model selection.
- The finding that fine-tuned open models beat zero-shot large models suggests similar patterns may hold for other short-text generation tasks in academic writing.
Load-bearing premise
That automatic metrics such as ROUGE and BERTScore serve as reliable stand-ins for whether human readers or domain experts would judge the generated titles as appropriate, clear, and appealing.
What would settle it
A human evaluation study in which domain experts rate the titles produced by each model and the model ranked highest by automatic metrics receives lower average scores for appropriateness or appeal than a lower-ranked model.
read the original abstract
The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an appropriate title is often challenging, and automated title generation can assist authors in this task. In this work, we propose a technique to generate paper titles from abstracts using open-weight pre-trained and large language models. We use the CSPubSum and LREC-COLING-2024 datasets and introduce a new dataset, SpringerSSAT, curated from four Springer journals in the social sciences. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate titles. Model performance is evaluated with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore metrics. Our experiments show that fine-tuned PEGASUS-large outperforms other models, including fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo, across most metrics. We further demonstrate that ChatGPT can generate creative paper titles. Overall, AI-generated titles are generally appropriate and reliable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes generating research paper titles from abstracts using open-weight pre-trained models (fine-tuned PEGASUS-large and LLaMA-3-8B) and zero-shot GPT-3.5-turbo. It introduces the SpringerSSAT dataset curated from Springer social science journals, alongside CSPubSum and LREC-COLING-2024. Performance is measured with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore; the central claim is that fine-tuned PEGASUS-large outperforms the other models across most metrics. The work also notes that ChatGPT produces creative titles and concludes that AI-generated titles are generally appropriate and reliable.
Significance. If the automatic-metric results are shown to track human judgments of title quality, the work would provide a practical demonstration that fine-tuned sequence-to-sequence models can assist title selection, together with a new public dataset (SpringerSSAT) that expands coverage to the social sciences. The use of open-weight models and explicit comparison to a strong zero-shot baseline are positive features that support reproducibility.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments section: the headline claim that fine-tuned PEGASUS-large 'outperforms other models across most metrics' is presented without any description of training hyperparameters, data splits, random seeds, or statistical significance tests. This absence makes it impossible to determine whether the reported gains are robust or could be artifacts of a single run.
- [Evaluation] Evaluation section: the paper relies exclusively on ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore as evidence of title quality. No human evaluation or correlation study is reported to show that higher scores on these metrics correspond to titles judged clearer, more specific, or more appealing by domain experts or readers—the very properties the abstract states titles must convey.
minor comments (2)
- [Abstract] The abstract states that 'AI-generated titles are generally appropriate and reliable' without quantifying what fraction of outputs were inspected or by what criteria appropriateness was judged.
- [Datasets] Dataset construction details for SpringerSSAT (e.g., filtering criteria, number of papers per journal, train/dev/test splits) are referenced but not fully specified in the provided text.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address the major points below and will revise the manuscript to enhance reproducibility and transparency.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the headline claim that fine-tuned PEGASUS-large 'outperforms other models across most metrics' is presented without any description of training hyperparameters, data splits, random seeds, or statistical significance tests. This absence makes it impossible to determine whether the reported gains are robust or could be artifacts of a single run.
Authors: We agree that these experimental details are necessary for assessing robustness. In the revised manuscript we will expand the Experiments section with a full description of training hyperparameters, data splits (including how the train/validation/test partitions were created for each dataset), random seeds, and statistical significance testing (e.g., bootstrap or paired tests) between model outputs. revision: yes
-
Referee: [Evaluation] Evaluation section: the paper relies exclusively on ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore as evidence of title quality. No human evaluation or correlation study is reported to show that higher scores on these metrics correspond to titles judged clearer, more specific, or more appealing by domain experts or readers—the very properties the abstract states titles must convey.
Authors: We recognize that automatic metrics alone do not fully capture title quality. We will add an explicit limitations paragraph that discusses the reliance on automatic metrics, cites prior work on their correlation with human judgments in summarization and title-generation settings, and includes additional qualitative examples comparing titles produced by each model. A dedicated human evaluation study lies outside the scope of the present work and is noted as future research. revision: partial
Circularity Check
No circularity in derivation chain
full rationale
The paper evaluates fine-tuned models and zero-shot GPT-3.5-turbo on title generation using standard automatic metrics (ROUGE, METEOR, MoverScore, BERTScore, SciBERTScore) against reference titles from external datasets (CSPubSum, LREC-COLING-2024, and the newly introduced SpringerSSAT). No load-bearing steps reduce any prediction or result to the paper's own inputs by construction, self-definition, or self-citation chains. The outperformance claim is an empirical comparison, not a tautology, and the derivation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained language models can be fine-tuned on abstract-title pairs to improve title generation performance
Reference graph
Works this paper leans on
-
[1]
Article title type and its relation with the number of downloads and citations.Scientometrics, 88(2):653–661, 2011
Hamid R Jamali and Mahsa Nikzad. Article title type and its relation with the number of downloads and citations.Scientometrics, 88(2):653–661, 2011
2011
-
[2]
The advantage of short paper titles.Royal Society Open Science, 2(8):150266, 2015
Adrian Letchford, Helen Susannah Moat, and Tobias Preis. The advantage of short paper titles.Royal Society Open Science, 2(8):150266, 2015
2015
-
[3]
Integrated construction and simulation of tool paths for milling dental crowns and bridges
Fatemeh Rostami, Asghar Mohammad- poorasl, and Mohammad Hajizadeh. The effect of characteristics of title on cita- tion rates of articles.Scientometrics, 98:2007–2010, 2014. T able 18:Model vs human, and human vs human evaluation on 10 selected examples fromLREC-COLING- 2024dataset. The models are fine-tuned onCSPubSumtraining set. All scores are reported...
-
[4]
Active Learning Design Choices for NER with Transformers
Tohida Rehman, Debarshi Kumar Sanyal, and Samiran Chattopadhyay. Can pre- trained language models generate titles for research papers? InInternational Conference on Asian Digital Libraries, pages 154–170. Springer, 2024. 20 T able 23:Comparison of author-written titles, model-generated titles (from PEGASUS-large and LLaMA-3-8B fine-tuned on theCSPubSumtra...
2024
-
[5]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text trans- former.Journal of Machine Learning Research, 21(140):1–67, 2020
2020
-
[6]
BART: Denoising sequence- to-sequence pre-training for natural lan- guage generation, translation, and compre- hension
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence- to-sequence pre-training for natural lan- guage generation, translation, and compre- hension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceed- ings of the 58th Ann...
2020
-
[7]
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. PEGASUS: pre- training with extracted gap-sentences for abstractive summarization. InProceed- ings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020
2020
-
[8]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Llama 3 model card
AI@Meta. Llama 3 model card. 2024
2024
-
[10]
ROUGE: A package for auto- matic evaluation of summaries
Chin-Yew Lin. ROUGE: A package for auto- matic evaluation of summaries. InText Summarization Branches Out, pages 74–81, 2004. 21
2004
-
[11]
METEOR: An automatic metric for mt evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Eval- uation Measures for Machine Translation and/or Summarization, pages 65–72, 2005
2005
-
[12]
Meyer, and Steffen Eger
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th Internation...
2019
-
[13]
Association for Computational Linguis- tics
-
[14]
BERTScore: Evaluating text generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. InProceedings of the Interna- tional Conference on Learning Representa- tions, 2020
2020
-
[15]
Entity-level factual consistency of abstractive text summarization
Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, and Bing Xiang. Entity-level factual consistency of abstractive text summarization. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors,Proceedings of the 16th Conference of the European Chapter of the Association for Computational Ling...
2021
-
[16]
The automatic creation of literature abstracts.IBM Journal of Research and Development, 2(2):159–165, 1958
Hans Peter Luhn. The automatic creation of literature abstracts.IBM Journal of Research and Development, 2(2):159–165, 1958
1958
-
[17]
Compendium: A text sum- marization system for generating abstracts of research papers.Data & Knowledge Engi- neering, 88:164–175, 2013
Elena Lloret, Mar´ ıa Teresa Rom´ a-Ferri, and Manuel Palomar. Compendium: A text sum- marization system for generating abstracts of research papers.Data & Knowledge Engi- neering, 88:164–175, 2013
2013
-
[18]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neu- ral networks. InProceedings of the 27th International Conference on Neural Informa- tion Processing Systems - Volume 2, NIPS’14, page 3104–3112, Cambridge, MA, USA, 2014. MIT Press
2014
-
[19]
Neural machine translation by jointly learning to align and translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, 2015
2015
-
[20]
Abstractive text summarization using sequence-to-sequence RNNs and beyond
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learn- ing, pages 280–290, Berlin, Germany, 2016. Association for Computational Linguistics
2016
-
[21]
See, Peter J
A. See, Peter J. Liu, and Christopher D. Man- ning. Get to the point: Summarization with pointer-generator networks. InProceedings of the 55th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), 2017
2017
-
[22]
Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Par- mar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
2017
-
[23]
BERT: Pre-training of deep bidirectional transformers for lan- guage understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019
2019
-
[24]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv preprint arXiv:1910.10683, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[25]
PEGASUS: Pre- training with extracted gap-sentences for abstractive summarization
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. PEGASUS: Pre- training with extracted gap-sentences for abstractive summarization. InProceed- ings of the International Conference on Machine Learning (ICLR), pages 11328– 11339. PMLR, 2020
2020
-
[26]
From neural sentence summarization to head- line generation: A coarse-to-fine approach
Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. From neural sentence summarization to head- line generation: A coarse-to-fine approach. In IJCAI, volume 17, pages 4109–4115, 2017. 22
2017
-
[27]
Automatic title generation in sci- entific articles for authorship assistance: a summarization approach.Journal of ICT Research and Applications, 11(3):253–267, 2017
Jan Wira Gotama Putra and Masayu Leylia Khodra. Automatic title generation in sci- entific articles for authorship assistance: a summarization approach.Journal of ICT Research and Applications, 11(3):253–267, 2017
2017
-
[28]
Automatic title generation for text with pre-trained transformer language model
Prakhar Mishra, Chaitali Diwan, Srinath Srinivasa, and Gopalakrishnan Srini- vasaraghavan. Automatic title generation for text with pre-trained transformer language model. InProceedings of the 2021 IEEE 15th International Conference on Seman- tic Computing (ICSC), pages 17–24. IEEE, 2021
2021
-
[29]
Paper abstract writing through editing mechanism.arXiv preprint arXiv:1805.06064, 2018
Qingyun Wang, Zhihao Zhou, Lifu Huang, Spencer Whitehead, Boliang Zhang, Heng Ji, and Kevin Knight. Paper abstract writing through editing mechanism.arXiv preprint arXiv:1805.06064, 2018
-
[30]
A dataset of attributes from papers of a machine learning conference.Data in brief, 24:103836, 2019
Diego Vallejo-Huanga, Paulina Morillo, and C` esar Ferri. A dataset of attributes from papers of a machine learning conference.Data in brief, 24:103836, 2019
2019
-
[31]
Gen- erating accurate and engaging research paper titles using nlp techniques
Thulasi Bikku, Nirmala Rani Narimalla, Keerthi Konda, Anusha Nakkala, Avanti Yarlagadda, and B Sachuthananthan. Gen- erating accurate and engaging research paper titles using nlp techniques. InInternational Conference on Innovations in Bio-Inspired Computing and Applications, pages 428–437. Springer, 2023
2023
-
[32]
OAG-BERT: Towards a unified backbone language model for academic knowledge services
Xiao Liu, Da Yin, Jingnan Zheng, Xingjian Zhang, Peng Zhang, Hongxia Yang, Yux- iao Dong, and Jie Tang. OAG-BERT: Towards a unified backbone language model for academic knowledge services. InProceed- ings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3418–3428, 2022
2022
-
[33]
Auto- matic generation of research highlights from scientific abstracts
Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, and Partha Pratim Das. Auto- matic generation of research highlights from scientific abstracts. InProceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Docu- ments (EEKE 2021), collocated with JCDL 2021, pages 69–70, 2021
2021
-
[34]
Named entity recognition based automatic generation of research highlights
Tohida Rehman, Debarshi Kumar Sanyal, Prasenjit Majumder, and Samiran Chat- topadhyay. Named entity recognition based automatic generation of research highlights. InProceedings of the Workshop on Scholarly Data Processing (SDP 2022),, collocated with COLING 2022, pages 163–169. ACL, 2022
2022
-
[35]
Research highlight generation with ELMo contextual embeddings.Scalable Computing: Practice and Experience, 24(2):181–190, 2023
Tohida Rehman, Debarshi Kumar Sanyal, and Samiran Chattopadhyay. Research highlight generation with ELMo contextual embeddings.Scalable Computing: Practice and Experience, 24(2):181–190, 2023
2023
-
[36]
Gen- eration of highlights from research papers using pointer-generator networks and SciB- ERT embeddings.IEEE Access, 11:91358– 91374, 2023
Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, and Partha Pratim Das. Gen- eration of highlights from research papers using pointer-generator networks and SciB- ERT embeddings.IEEE Access, 11:91358– 91374, 2023
2023
-
[37]
Why and how to embrace AI such as ChatGPT in your academic life
Zhicheng Lin. Why and how to embrace AI such as ChatGPT in your academic life. Royal Society Open Science, 10(8):230658, 2023
2023
-
[38]
Edward J. Ciaccio. Use of artificial intelli- gence in scientific paper writing.Informatics in Medicine Unlocked, 41:101253, 2023
2023
-
[39]
Modest: A dataset for multi domain scientific title gen- eration.Knowledge-Based Systems, page 113557, 2025
Necva B¨ ol¨ uc¨ u, Yunus Can Bilge, Dilber C ¸ etinta¸ s, and Zehra Y¨ ucel. Modest: A dataset for multi domain scientific title gen- eration.Knowledge-Based Systems, page 113557, 2025
2025
-
[40]
A supervised approach to extractive summarisation of scientific papers
Ed Collins, Isabelle Augenstein, and Sebas- tian Riedel. A supervised approach to extractive summarisation of scientific papers. InProceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 195–205. Association for Computational Linguistics, 2017
2017
-
[41]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020. 23
1901
-
[42]
SciBERT: A pretrained language model for scientific text
Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Pro- ceedings of the 2019 Conference on Empiri- cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 3615–36...
2019
-
[43]
From word embeddings to document distances
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. InProceedings of the International Conference on Machine Learn- ing, pages 957–966. PMLR, 2015
2015
-
[44]
Evaluating the factual consistency of abstractive text summarization
Wojciech Kryscinski, Bryan McCann, Caim- ing Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Pro- ceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online, Novem- ber 2020. Associatio...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.