pith. sign in

arxiv: 2606.05085 · v1 · pith:5SBTPGD3new · submitted 2026-06-03 · 💻 cs.CL · cs.AI

Automatic Generation of Titles for Research Papers Using Language Models

Pith reviewed 2026-06-28 06:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords title generationlanguage modelsPEGASUSabstractsresearch papersfine-tuningautomatic evaluation metricsSpringerSSAT dataset
0
0 comments X

The pith

Fine-tuned PEGASUS-large generates research paper titles from abstracts more accurately than LLaMA-3-8B or zero-shot GPT-3.5-turbo across standard metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language models can be applied to create titles directly from paper abstracts, with a new dataset added to existing ones for training and testing. It compares several models and finds that fine-tuning PEGASUS-large yields the strongest results on automatic scores while zero-shot GPT-3.5-turbo lags behind. The work shows that such generated titles are generally appropriate and that ChatGPT can also produce creative variants. A sympathetic reader would care because selecting a clear title is a recurring difficulty for authors, and reliable automation could reduce that effort.

Core claim

The central claim is that fine-tuned PEGASUS-large outperforms fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo on ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore when generating titles from abstracts on the CSPubSum, LREC-COLING-2024, and new SpringerSSAT datasets, and that the resulting titles are generally appropriate and reliable.

What carries the argument

Fine-tuned PEGASUS-large applied to abstract-to-title generation, evaluated by overlap and embedding-based metrics.

If this is right

  • Authors in computer science and social sciences can use the fine-tuned PEGASUS model as a practical assistant when drafting titles.
  • The SpringerSSAT dataset provides additional training material for future title-generation work in the social sciences.
  • Zero-shot prompting of GPT-3.5-turbo is shown to be less competitive than fine-tuned smaller models on the chosen metrics.
  • ChatGPT can be prompted to produce creative title alternatives that differ from the more literal outputs of the fine-tuned models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fine-tuning approach could be tested on generating titles from other sections such as conclusions or introductions.
  • If automatic metrics correlate poorly with human preference, future work would need to collect direct human ratings to guide model selection.
  • The finding that fine-tuned open models beat zero-shot large models suggests similar patterns may hold for other short-text generation tasks in academic writing.

Load-bearing premise

That automatic metrics such as ROUGE and BERTScore serve as reliable stand-ins for whether human readers or domain experts would judge the generated titles as appropriate, clear, and appealing.

What would settle it

A human evaluation study in which domain experts rate the titles produced by each model and the model ranked highest by automatic metrics receives lower average scores for appropriateness or appeal than a lower-ranked model.

read the original abstract

The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an appropriate title is often challenging, and automated title generation can assist authors in this task. In this work, we propose a technique to generate paper titles from abstracts using open-weight pre-trained and large language models. We use the CSPubSum and LREC-COLING-2024 datasets and introduce a new dataset, SpringerSSAT, curated from four Springer journals in the social sciences. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate titles. Model performance is evaluated with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore metrics. Our experiments show that fine-tuned PEGASUS-large outperforms other models, including fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo, across most metrics. We further demonstrate that ChatGPT can generate creative paper titles. Overall, AI-generated titles are generally appropriate and reliable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes generating research paper titles from abstracts using open-weight pre-trained models (fine-tuned PEGASUS-large and LLaMA-3-8B) and zero-shot GPT-3.5-turbo. It introduces the SpringerSSAT dataset curated from Springer social science journals, alongside CSPubSum and LREC-COLING-2024. Performance is measured with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore; the central claim is that fine-tuned PEGASUS-large outperforms the other models across most metrics. The work also notes that ChatGPT produces creative titles and concludes that AI-generated titles are generally appropriate and reliable.

Significance. If the automatic-metric results are shown to track human judgments of title quality, the work would provide a practical demonstration that fine-tuned sequence-to-sequence models can assist title selection, together with a new public dataset (SpringerSSAT) that expands coverage to the social sciences. The use of open-weight models and explicit comparison to a strong zero-shot baseline are positive features that support reproducibility.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the headline claim that fine-tuned PEGASUS-large 'outperforms other models across most metrics' is presented without any description of training hyperparameters, data splits, random seeds, or statistical significance tests. This absence makes it impossible to determine whether the reported gains are robust or could be artifacts of a single run.
  2. [Evaluation] Evaluation section: the paper relies exclusively on ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore as evidence of title quality. No human evaluation or correlation study is reported to show that higher scores on these metrics correspond to titles judged clearer, more specific, or more appealing by domain experts or readers—the very properties the abstract states titles must convey.
minor comments (2)
  1. [Abstract] The abstract states that 'AI-generated titles are generally appropriate and reliable' without quantifying what fraction of outputs were inspected or by what criteria appropriateness was judged.
  2. [Datasets] Dataset construction details for SpringerSSAT (e.g., filtering criteria, number of papers per journal, train/dev/test splits) are referenced but not fully specified in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the major points below and will revise the manuscript to enhance reproducibility and transparency.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the headline claim that fine-tuned PEGASUS-large 'outperforms other models across most metrics' is presented without any description of training hyperparameters, data splits, random seeds, or statistical significance tests. This absence makes it impossible to determine whether the reported gains are robust or could be artifacts of a single run.

    Authors: We agree that these experimental details are necessary for assessing robustness. In the revised manuscript we will expand the Experiments section with a full description of training hyperparameters, data splits (including how the train/validation/test partitions were created for each dataset), random seeds, and statistical significance testing (e.g., bootstrap or paired tests) between model outputs. revision: yes

  2. Referee: [Evaluation] Evaluation section: the paper relies exclusively on ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore as evidence of title quality. No human evaluation or correlation study is reported to show that higher scores on these metrics correspond to titles judged clearer, more specific, or more appealing by domain experts or readers—the very properties the abstract states titles must convey.

    Authors: We recognize that automatic metrics alone do not fully capture title quality. We will add an explicit limitations paragraph that discusses the reliance on automatic metrics, cites prior work on their correlation with human judgments in summarization and title-generation settings, and includes additional qualitative examples comparing titles produced by each model. A dedicated human evaluation study lies outside the scope of the present work and is noted as future research. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper evaluates fine-tuned models and zero-shot GPT-3.5-turbo on title generation using standard automatic metrics (ROUGE, METEOR, MoverScore, BERTScore, SciBERTScore) against reference titles from external datasets (CSPubSum, LREC-COLING-2024, and the newly introduced SpringerSSAT). No load-bearing steps reduce any prediction or result to the paper's own inputs by construction, self-definition, or self-citation chains. The outperformance claim is an empirical comparison, not a tautology, and the derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that fine-tuning on abstract-title pairs transfers to good title generation and that automatic metrics correlate with human notions of title quality; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Pre-trained language models can be fine-tuned on abstract-title pairs to improve title generation performance
    This premise justifies the use of fine-tuned PEGASUS and LLaMA models in the experiments.

pith-pipeline@v0.9.1-grok · 5717 in / 1167 out tokens · 48417 ms · 2026-06-28T06:42:02.684049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Article title type and its relation with the number of downloads and citations.Scientometrics, 88(2):653–661, 2011

    Hamid R Jamali and Mahsa Nikzad. Article title type and its relation with the number of downloads and citations.Scientometrics, 88(2):653–661, 2011

  2. [2]

    The advantage of short paper titles.Royal Society Open Science, 2(8):150266, 2015

    Adrian Letchford, Helen Susannah Moat, and Tobias Preis. The advantage of short paper titles.Royal Society Open Science, 2(8):150266, 2015

  3. [3]

    Integrated construction and simulation of tool paths for milling dental crowns and bridges

    Fatemeh Rostami, Asghar Mohammad- poorasl, and Mohammad Hajizadeh. The effect of characteristics of title on cita- tion rates of articles.Scientometrics, 98:2007–2010, 2014. T able 18:Model vs human, and human vs human evaluation on 10 selected examples fromLREC-COLING- 2024dataset. The models are fine-tuned onCSPubSumtraining set. All scores are reported...

  4. [4]

    Active Learning Design Choices for NER with Transformers

    Tohida Rehman, Debarshi Kumar Sanyal, and Samiran Chattopadhyay. Can pre- trained language models generate titles for research papers? InInternational Conference on Asian Digital Libraries, pages 154–170. Springer, 2024. 20 T able 23:Comparison of author-written titles, model-generated titles (from PEGASUS-large and LLaMA-3-8B fine-tuned on theCSPubSumtra...

  5. [5]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text trans- former.Journal of Machine Learning Research, 21(140):1–67, 2020

  6. [6]

    BART: Denoising sequence- to-sequence pre-training for natural lan- guage generation, translation, and compre- hension

    Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence- to-sequence pre-training for natural lan- guage generation, translation, and compre- hension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceed- ings of the 58th Ann...

  7. [7]

    Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. PEGASUS: pre- training with extracted gap-sentences for abstractive summarization. InProceed- ings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020

  8. [8]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  9. [9]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024

  10. [10]

    ROUGE: A package for auto- matic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for auto- matic evaluation of summaries. InText Summarization Branches Out, pages 74–81, 2004. 21

  11. [11]

    METEOR: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Eval- uation Measures for Machine Translation and/or Summarization, pages 65–72, 2005

  12. [12]

    Meyer, and Steffen Eger

    Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th Internation...

  13. [13]

    Association for Computational Linguis- tics

  14. [14]

    BERTScore: Evaluating text generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. InProceedings of the Interna- tional Conference on Learning Representa- tions, 2020

  15. [15]

    Entity-level factual consistency of abstractive text summarization

    Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, and Bing Xiang. Entity-level factual consistency of abstractive text summarization. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors,Proceedings of the 16th Conference of the European Chapter of the Association for Computational Ling...

  16. [16]

    The automatic creation of literature abstracts.IBM Journal of Research and Development, 2(2):159–165, 1958

    Hans Peter Luhn. The automatic creation of literature abstracts.IBM Journal of Research and Development, 2(2):159–165, 1958

  17. [17]

    Compendium: A text sum- marization system for generating abstracts of research papers.Data & Knowledge Engi- neering, 88:164–175, 2013

    Elena Lloret, Mar´ ıa Teresa Rom´ a-Ferri, and Manuel Palomar. Compendium: A text sum- marization system for generating abstracts of research papers.Data & Knowledge Engi- neering, 88:164–175, 2013

  18. [18]

    Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neu- ral networks. InProceedings of the 27th International Conference on Neural Informa- tion Processing Systems - Volume 2, NIPS’14, page 3104–3112, Cambridge, MA, USA, 2014. MIT Press

  19. [19]

    Neural machine translation by jointly learning to align and translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, 2015

  20. [20]

    Abstractive text summarization using sequence-to-sequence RNNs and beyond

    Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learn- ing, pages 280–290, Berlin, Germany, 2016. Association for Computational Linguistics

  21. [21]

    See, Peter J

    A. See, Peter J. Liu, and Christopher D. Man- ning. Get to the point: Summarization with pointer-generator networks. InProceedings of the 55th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), 2017

  22. [22]

    Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Par- mar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

  23. [23]

    BERT: Pre-training of deep bidirectional transformers for lan- guage understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019

  24. [24]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv preprint arXiv:1910.10683, 2019

  25. [25]

    PEGASUS: Pre- training with extracted gap-sentences for abstractive summarization

    Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. PEGASUS: Pre- training with extracted gap-sentences for abstractive summarization. InProceed- ings of the International Conference on Machine Learning (ICLR), pages 11328– 11339. PMLR, 2020

  26. [26]

    From neural sentence summarization to head- line generation: A coarse-to-fine approach

    Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. From neural sentence summarization to head- line generation: A coarse-to-fine approach. In IJCAI, volume 17, pages 4109–4115, 2017. 22

  27. [27]

    Automatic title generation in sci- entific articles for authorship assistance: a summarization approach.Journal of ICT Research and Applications, 11(3):253–267, 2017

    Jan Wira Gotama Putra and Masayu Leylia Khodra. Automatic title generation in sci- entific articles for authorship assistance: a summarization approach.Journal of ICT Research and Applications, 11(3):253–267, 2017

  28. [28]

    Automatic title generation for text with pre-trained transformer language model

    Prakhar Mishra, Chaitali Diwan, Srinath Srinivasa, and Gopalakrishnan Srini- vasaraghavan. Automatic title generation for text with pre-trained transformer language model. InProceedings of the 2021 IEEE 15th International Conference on Seman- tic Computing (ICSC), pages 17–24. IEEE, 2021

  29. [29]

    Paper abstract writing through editing mechanism.arXiv preprint arXiv:1805.06064, 2018

    Qingyun Wang, Zhihao Zhou, Lifu Huang, Spencer Whitehead, Boliang Zhang, Heng Ji, and Kevin Knight. Paper abstract writing through editing mechanism.arXiv preprint arXiv:1805.06064, 2018

  30. [30]

    A dataset of attributes from papers of a machine learning conference.Data in brief, 24:103836, 2019

    Diego Vallejo-Huanga, Paulina Morillo, and C` esar Ferri. A dataset of attributes from papers of a machine learning conference.Data in brief, 24:103836, 2019

  31. [31]

    Gen- erating accurate and engaging research paper titles using nlp techniques

    Thulasi Bikku, Nirmala Rani Narimalla, Keerthi Konda, Anusha Nakkala, Avanti Yarlagadda, and B Sachuthananthan. Gen- erating accurate and engaging research paper titles using nlp techniques. InInternational Conference on Innovations in Bio-Inspired Computing and Applications, pages 428–437. Springer, 2023

  32. [32]

    OAG-BERT: Towards a unified backbone language model for academic knowledge services

    Xiao Liu, Da Yin, Jingnan Zheng, Xingjian Zhang, Peng Zhang, Hongxia Yang, Yux- iao Dong, and Jie Tang. OAG-BERT: Towards a unified backbone language model for academic knowledge services. InProceed- ings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3418–3428, 2022

  33. [33]

    Auto- matic generation of research highlights from scientific abstracts

    Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, and Partha Pratim Das. Auto- matic generation of research highlights from scientific abstracts. InProceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Docu- ments (EEKE 2021), collocated with JCDL 2021, pages 69–70, 2021

  34. [34]

    Named entity recognition based automatic generation of research highlights

    Tohida Rehman, Debarshi Kumar Sanyal, Prasenjit Majumder, and Samiran Chat- topadhyay. Named entity recognition based automatic generation of research highlights. InProceedings of the Workshop on Scholarly Data Processing (SDP 2022),, collocated with COLING 2022, pages 163–169. ACL, 2022

  35. [35]

    Research highlight generation with ELMo contextual embeddings.Scalable Computing: Practice and Experience, 24(2):181–190, 2023

    Tohida Rehman, Debarshi Kumar Sanyal, and Samiran Chattopadhyay. Research highlight generation with ELMo contextual embeddings.Scalable Computing: Practice and Experience, 24(2):181–190, 2023

  36. [36]

    Gen- eration of highlights from research papers using pointer-generator networks and SciB- ERT embeddings.IEEE Access, 11:91358– 91374, 2023

    Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, and Partha Pratim Das. Gen- eration of highlights from research papers using pointer-generator networks and SciB- ERT embeddings.IEEE Access, 11:91358– 91374, 2023

  37. [37]

    Why and how to embrace AI such as ChatGPT in your academic life

    Zhicheng Lin. Why and how to embrace AI such as ChatGPT in your academic life. Royal Society Open Science, 10(8):230658, 2023

  38. [38]

    Edward J. Ciaccio. Use of artificial intelli- gence in scientific paper writing.Informatics in Medicine Unlocked, 41:101253, 2023

  39. [39]

    Modest: A dataset for multi domain scientific title gen- eration.Knowledge-Based Systems, page 113557, 2025

    Necva B¨ ol¨ uc¨ u, Yunus Can Bilge, Dilber C ¸ etinta¸ s, and Zehra Y¨ ucel. Modest: A dataset for multi domain scientific title gen- eration.Knowledge-Based Systems, page 113557, 2025

  40. [40]

    A supervised approach to extractive summarisation of scientific papers

    Ed Collins, Isabelle Augenstein, and Sebas- tian Riedel. A supervised approach to extractive summarisation of scientific papers. InProceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 195–205. Association for Computational Linguistics, 2017

  41. [41]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020. 23

  42. [42]

    SciBERT: A pretrained language model for scientific text

    Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Pro- ceedings of the 2019 Conference on Empiri- cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 3615–36...

  43. [43]

    From word embeddings to document distances

    Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. InProceedings of the International Conference on Machine Learn- ing, pages 957–966. PMLR, 2015

  44. [44]

    Evaluating the factual consistency of abstractive text summarization

    Wojciech Kryscinski, Bryan McCann, Caim- ing Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Pro- ceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online, Novem- ber 2020. Associatio...