NUTSHELL: A Dataset for Abstract Generation from Scientific Talks
Pith reviewed 2026-05-23 02:48 UTC · model grok-4.3
The pith
A dataset pairing scientific talk recordings with their abstracts enables training of models to generate summaries from speech.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NUTSHELL is a multimodal dataset of *ACL conference talks paired with their corresponding abstracts. Training speech-to-abstract generation models on NUTSHELL produces measurable gains in output quality over approaches that lack such paired data, as measured by automatic metrics and human judgments, and the dataset also reveals ongoing challenges in the task.
What carries the argument
The NUTSHELL dataset of aligned talk recordings and abstracts, which supplies training examples for speech-to-abstract generation models.
If this is right
- Models trained on the paired data outperform those trained without access to such alignments.
- Both automatic metrics and human evaluations can be used to assess the quality of generated abstracts from talks.
- The open release of the dataset supports development of improved models and evaluation methods for the task.
- The work identifies specific difficulties that remain in generating abstracts directly from spoken scientific presentations.
Where Pith is reading between the lines
- The same pairing approach could be applied to talks from other scientific conferences or disciplines to expand available training data.
- Models trained on this data might later be adapted to generate summaries from other forms of spoken scientific content such as lectures or seminars.
- Wider use of such generated abstracts could reduce the time researchers spend deciding which conference talks to watch in full.
Load-bearing premise
The collected talks and abstracts can be reliably paired at sufficient scale and quality to serve as effective training data for speech-to-abstract generation models.
What would settle it
A test in which models trained on NUTSHELL produce abstracts that are no better, by automatic metrics or human judgment, than models trained on unrelated text data when both are evaluated on the same set of held-out talks.
Figures
read the original abstract
Scientific communication is receiving increasing attention in natural language processing, especially to help researches access, summarize, and generate content. One emerging application in this area is Speech-to-Abstract Generation (SAG), which aims to automatically generate abstracts from recorded scientific presentations. SAG enables researchers to efficiently engage with conference talks, but progress has been limited by a lack of large-scale datasets. To address this gap, we introduce NUTSHELL, a novel multimodal dataset of *ACL conference talks paired with their corresponding abstracts. We establish strong baselines for SAG and evaluate the quality of generated abstracts using both automatic metrics and human judgments. Our results highlight the challenges of SAG and demonstrate the benefits of training on NUTSHELL. By releasing NUTSHELL under an open license (CC-BY 4.0), we aim to advance research in SAG and foster the development of improved models and evaluation methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NUTSHELL, a multimodal dataset of *ACL conference talks paired with their corresponding abstracts, to support the Speech-to-Abstract Generation (SAG) task. It establishes baseline models for SAG, evaluates generated abstracts using automatic metrics and human judgments, highlights the challenges of SAG, and claims that training on NUTSHELL yields measurable benefits. The dataset is released under the CC-BY 4.0 license.
Significance. If the dataset consists of reliably paired, large-scale, high-quality examples, the release could provide a useful resource for training and benchmarking models in scientific communication and multimodal summarization. The combination of automatic metrics and human evaluation is a methodological strength, and the open license supports reproducibility and follow-on work.
major comments (1)
- [NUTSHELL dataset construction] The dataset construction (described in the section introducing NUTSHELL) provides no details on sourcing of talks and abstracts, the pairing/matching procedure (e.g., title/author overlap, manual alignment, temporal synchronization), filtering criteria, or resulting statistics such as number of pairs, average talk length, or abstract fidelity. This information is load-bearing for the central claim that training on NUTSHELL produces measurable benefits over prior approaches.
minor comments (1)
- [Abstract] The abstract states that baselines are established and benefits are demonstrated but reports no quantitative results, dataset size, or key statistics; adding these would improve immediate readability without altering the core contribution.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for highlighting the need for greater transparency in the NUTSHELL dataset construction. We agree that these details are essential to support the central claims and will expand the relevant section in the revised manuscript.
read point-by-point responses
-
Referee: [NUTSHELL dataset construction] The dataset construction (described in the section introducing NUTSHELL) provides no details on sourcing of talks and abstracts, the pairing/matching procedure (e.g., title/author overlap, manual alignment, temporal synchronization), filtering criteria, or resulting statistics such as number of pairs, average talk length, or abstract fidelity. This information is load-bearing for the central claim that training on NUTSHELL produces measurable benefits over prior approaches.
Authors: We acknowledge that the current manuscript provides insufficient detail on these aspects of dataset construction. In the revised version we will expand the NUTSHELL introduction section to describe: (1) sourcing of video recordings and abstracts from ACL conferences via official archives and the ACL Anthology; (2) the pairing procedure, which relies on title/author overlap followed by manual verification of temporal alignment between talk segments and abstract content; (3) filtering criteria including minimum talk duration, abstract length, and exclusion of non-English or low-quality recordings; and (4) key statistics such as the total number of pairs, average talk length in minutes, and quantitative measures of abstract fidelity (e.g., ROUGE overlap with talk transcripts). These additions will directly substantiate the reported training benefits. revision: yes
Circularity Check
No circularity: dataset release with independent baselines
full rationale
The paper introduces the NUTSHELL dataset and reports baseline SAG results using automatic metrics and human judgments. No mathematical derivations, parameter fittings, predictions, or uniqueness theorems appear anywhere in the text. The contribution is a data release plus empirical evaluation; claims about benefits of training on NUTSHELL rest on the reported baselines rather than any reduction to self-citations, fitted inputs, or definitional equivalences. Dataset construction details are described as load-bearing but do not constitute circularity under the specified patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
International Speech Communication Associa- tion. Publisher Copyright: Copyright © 2021 ISCA.; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021. Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin...
work page 2021
-
[2]
Qwen2-audio technical report.Preprint, arXiv:2407.10759. Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tiona...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
A supervised approach to extractive sum- marisation of scientific papers. InProceedings of the 21st Conference on Computational Natural Lan- guage Learning (CoNLL 2017), pages 195–205, Van- couver, Canada. Association for Computational Lin- guistics. Daniel Deutsch and Dan Roth. 2020. SacreROUGE: An open-source library for using and developing sum- mariza...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Content selection in deep learning models of summarization. InProceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing, pages 1818–1828, Brussels, Belgium. Association for Computational Linguistics. Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text Generation from Knowledge G...
-
[5]
Robust Speech Recognition via Large-Scale Weak Supervision
Generating and validating abstracts of meeting conversations: a user study. InProceedings of the 6th International Natural Language Generation Con- ference. Association for Computational Linguistics. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak su- pervis...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
SLUE phase-2: A benchmark suite of diverse spoken language understanding tasks. InProceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8906–8937, Toronto, Canada. Association for Computational Linguistics. Sotaro Takeshita, Tommaso Green, Ines Reinig, Kai Eckert, and Simone Ponzetto. 2024...
-
[7]
Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2024. Judging llm-as-a-judge with mt-bench and...
-
[8]
We segment the audio into one-minute chunks, encode each chunk using the encoder and then concatenate the encoded representations be- fore passing them through the adapter and LLM backbone
-
[9]
Despite these adjustments, we encountered mem- ory limitations for audio files exceeding 35 minutes
We use a batch size of 1 for fine-tuning with NUTSHELL. Despite these adjustments, we encountered mem- ory limitations for audio files exceeding 35 minutes. In such cases, we truncate the audio to 35 minutes, which affects one example in the test set. The training of the models was conducted on four NVIDIA A100-SXM4-40GB GPUs. The con- trastive pretrainin...
-
[10]
11 10 Model RougeL BERTScore Llama3.1-7B-Instruct F1↑F1↑Score with Expl.↑Plain Score↑Avg
and BERTScore (Zhang et al., 2020). 11 10 Model RougeL BERTScore Llama3.1-7B-Instruct F1↑F1↑Score with Expl.↑Plain Score↑Avg. Rank↓ Whisper + LLama31-Instruct 23.26 86.8177.75 84.30 1.23 Qwen2-Audio 16.26 84.94 48.42 39.50 3.47 End2End Finetuned24.47 86.7170.67 75.73 1.83 Table 6: Baseline Results, the finetuned model is a HuBERT + Qformer + LLama31Instru...
work page 2020
-
[11]
The results with Qwen-as-a-judge can be found in Table 4
as the judge, we obtain the same ranking as with Llama. The results with Qwen-as-a-judge can be found in Table 4. E Human Evaluation for Model Outputs We evaluate the models using ROUGE (Lin, 2004), BERTScore (Zhang et al., 2020), and LLM-as-a- 12https://huggingface.co/spaces/ evaluate-metric/bertscore judge. However, it is known that automatic evalu- ati...
work page 2004
-
[12]
** Relevance **: Does the predicted abstract capture the main points of the gold abstract ?\ n
-
[13]
** Coherence **: Is the predicted abstract logically organized and easy to follow ?\ n
-
[14]
** Conciseness **: Is the predicted abstract free from unnecessary details ?\ n
-
[15]
- Provide a ** brief explanation ** for the assigned score .\ n \ n
** Factual Accuracy **: Are the claims in the predicted abstract consistent with the gold abstract ?\ n \ n For each criterion :\ n - Assign a ** score ** between 1 and 10 (1 = very poor , 10 = excellent ) .\ n " - Provide a ** brief explanation ** for the assigned score .\ n \ n " Your output must be in the following JSON format :\ n \ n " {\" relevance ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.