NUTSHELL: A Dataset for Abstract Generation from Scientific Talks

Beatrice Savoldi; Jan Niehues; Luisa Bentivogli; Maike Z\"ufle; Marco Gaido; Sara Papi

arxiv: 2502.16942 · v2 · submitted 2025-02-24 · 💻 cs.CL

NUTSHELL: A Dataset for Abstract Generation from Scientific Talks

Maike Z\"ufle , Sara Papi , Beatrice Savoldi , Marco Gaido , Luisa Bentivogli , Jan Niehues This is my paper

Pith reviewed 2026-05-23 02:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords speech-to-abstract generationmultimodal datasetscientific talksconference abstractssummarizationACL conferencesdataset release

0 comments

The pith

A dataset pairing scientific talk recordings with their abstracts enables training of models to generate summaries from speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NUTSHELL as a collection of recorded *ACL conference talks aligned with the written abstracts that describe those talks. It trains baseline models on this data for the task of turning spoken presentations into abstracts and measures performance with both standard automatic scores and human ratings. The results indicate that access to this paired data produces better generated abstracts than training without it while also exposing remaining difficulties in the generation task. The release of the data under an open license is intended to support further work on helping researchers process conference content more quickly.

Core claim

NUTSHELL is a multimodal dataset of *ACL conference talks paired with their corresponding abstracts. Training speech-to-abstract generation models on NUTSHELL produces measurable gains in output quality over approaches that lack such paired data, as measured by automatic metrics and human judgments, and the dataset also reveals ongoing challenges in the task.

What carries the argument

The NUTSHELL dataset of aligned talk recordings and abstracts, which supplies training examples for speech-to-abstract generation models.

If this is right

Models trained on the paired data outperform those trained without access to such alignments.
Both automatic metrics and human evaluations can be used to assess the quality of generated abstracts from talks.
The open release of the dataset supports development of improved models and evaluation methods for the task.
The work identifies specific difficulties that remain in generating abstracts directly from spoken scientific presentations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pairing approach could be applied to talks from other scientific conferences or disciplines to expand available training data.
Models trained on this data might later be adapted to generate summaries from other forms of spoken scientific content such as lectures or seminars.
Wider use of such generated abstracts could reduce the time researchers spend deciding which conference talks to watch in full.

Load-bearing premise

The collected talks and abstracts can be reliably paired at sufficient scale and quality to serve as effective training data for speech-to-abstract generation models.

What would settle it

A test in which models trained on NUTSHELL produce abstracts that are no better, by automatic metrics or human judgment, than models trained on unrelated text data when both are evaluated on the same set of held-out talks.

Figures

Figures reproduced from arXiv: 2502.16942 by Beatrice Savoldi, Jan Niehues, Luisa Bentivogli, Maike Z\"ufle, Marco Gaido, Sara Papi.

**Figure 2.** Figure 2: Prompts for LLM as a judge. We use the same prompt for both, Qwen2-7bInstruct and Llama 3.1 8B [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Instructions for annotators to evaluate whether the paper abstracts are good and informative abstracts for [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Instructions for human annotators for ranking model outputs. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Scientific communication is receiving increasing attention in natural language processing, especially to help researches access, summarize, and generate content. One emerging application in this area is Speech-to-Abstract Generation (SAG), which aims to automatically generate abstracts from recorded scientific presentations. SAG enables researchers to efficiently engage with conference talks, but progress has been limited by a lack of large-scale datasets. To address this gap, we introduce NUTSHELL, a novel multimodal dataset of *ACL conference talks paired with their corresponding abstracts. We establish strong baselines for SAG and evaluate the quality of generated abstracts using both automatic metrics and human judgments. Our results highlight the challenges of SAG and demonstrate the benefits of training on NUTSHELL. By releasing NUTSHELL under an open license (CC-BY 4.0), we aim to advance research in SAG and foster the development of improved models and evaluation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NUTSHELL is a new dataset pairing *ACL talks with abstracts for speech-to-abstract generation, but the value depends on whether the pairing and scale details hold up.

read the letter

The paper's main contribution is the release of NUTSHELL, a multimodal dataset of recorded conference talks matched to their abstracts, aimed at the SAG task. This fills a gap where no such resource existed in the cited literature, and they run baselines with both automatic metrics and human judgments while releasing it under CC-BY 4.0. That part is straightforward and useful for anyone building models in this area. They also note the challenges of the task, which aligns with what one would expect from speech-to-text summarization work. The construction and evaluation steps are the parts that matter most here. The abstract leaves out any numbers on dataset size, talk lengths, matching method, or actual metric scores, which makes it hard to judge if the claimed benefits from training on it are real or just from basic setup. The stress-test concern about reliable pairing is fair based on what's shown; if the full paper has clear sourcing, alignment steps, filtering, and statistics, that would address it directly. If those details are thin or the scale is small, the baselines won't demonstrate much. This paper is for researchers in NLP who need speech-based summarization data or are working on scientific communication tools. A reader looking for new resources to train or benchmark models would find the dataset itself worth checking, even if the experiments stay at baseline level. It deserves peer review because data papers like this can stand on the resource quality rather than novel methods, provided the construction is documented well enough to reproduce or extend.

Referee Report

1 major / 1 minor

Summary. The paper introduces NUTSHELL, a multimodal dataset of *ACL conference talks paired with their corresponding abstracts, to support the Speech-to-Abstract Generation (SAG) task. It establishes baseline models for SAG, evaluates generated abstracts using automatic metrics and human judgments, highlights the challenges of SAG, and claims that training on NUTSHELL yields measurable benefits. The dataset is released under the CC-BY 4.0 license.

Significance. If the dataset consists of reliably paired, large-scale, high-quality examples, the release could provide a useful resource for training and benchmarking models in scientific communication and multimodal summarization. The combination of automatic metrics and human evaluation is a methodological strength, and the open license supports reproducibility and follow-on work.

major comments (1)

[NUTSHELL dataset construction] The dataset construction (described in the section introducing NUTSHELL) provides no details on sourcing of talks and abstracts, the pairing/matching procedure (e.g., title/author overlap, manual alignment, temporal synchronization), filtering criteria, or resulting statistics such as number of pairs, average talk length, or abstract fidelity. This information is load-bearing for the central claim that training on NUTSHELL produces measurable benefits over prior approaches.

minor comments (1)

[Abstract] The abstract states that baselines are established and benefits are demonstrated but reports no quantitative results, dataset size, or key statistics; adding these would improve immediate readability without altering the core contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting the need for greater transparency in the NUTSHELL dataset construction. We agree that these details are essential to support the central claims and will expand the relevant section in the revised manuscript.

read point-by-point responses

Referee: [NUTSHELL dataset construction] The dataset construction (described in the section introducing NUTSHELL) provides no details on sourcing of talks and abstracts, the pairing/matching procedure (e.g., title/author overlap, manual alignment, temporal synchronization), filtering criteria, or resulting statistics such as number of pairs, average talk length, or abstract fidelity. This information is load-bearing for the central claim that training on NUTSHELL produces measurable benefits over prior approaches.

Authors: We acknowledge that the current manuscript provides insufficient detail on these aspects of dataset construction. In the revised version we will expand the NUTSHELL introduction section to describe: (1) sourcing of video recordings and abstracts from ACL conferences via official archives and the ACL Anthology; (2) the pairing procedure, which relies on title/author overlap followed by manual verification of temporal alignment between talk segments and abstract content; (3) filtering criteria including minimum talk duration, abstract length, and exclusion of non-English or low-quality recordings; and (4) key statistics such as the total number of pairs, average talk length in minutes, and quantitative measures of abstract fidelity (e.g., ROUGE overlap with talk transcripts). These additions will directly substantiate the reported training benefits. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with independent baselines

full rationale

The paper introduces the NUTSHELL dataset and reports baseline SAG results using automatic metrics and human judgments. No mathematical derivations, parameter fittings, predictions, or uniqueness theorems appear anywhere in the text. The contribution is a data release plus empirical evaluation; claims about benefits of training on NUTSHELL rest on the reported baselines rather than any reduction to self-citations, fitted inputs, or definitional equivalences. Dataset construction details are described as load-bearing but do not constitute circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard NLP dataset-construction assumptions such as the existence of aligned speech-text pairs from conferences and the utility of conventional automatic metrics plus human judgments for evaluation; no free parameters, new axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5693 in / 1065 out tokens · 39554 ms · 2026-05-23T02:48:13.611771+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

International Speech Communication Associa- tion. Publisher Copyright: Copyright © 2021 ISCA.; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021. Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin...

work page 2021
[2]

Qwen2-Audio Technical Report

Qwen2-audio technical report.Preprint, arXiv:2407.10759. Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tiona...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

The Llama 3 Herd of Models

A supervised approach to extractive sum- marisation of scientific papers. InProceedings of the 21st Conference on Computational Natural Lan- guage Learning (CoNLL 2017), pages 195–205, Van- couver, Canada. Association for Computational Lin- guistics. Daniel Deutsch and Dan Roth. 2020. SacreROUGE: An open-source library for using and developing sum- mariza...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

InProceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing, pages 1818–1828, Brussels, Belgium

Content selection in deep learning models of summarization. InProceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing, pages 1818–1828, Brussels, Belgium. Association for Computational Linguistics. Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text Generation from Knowledge G...

work page arXiv 2018
[5]

Robust Speech Recognition via Large-Scale Weak Supervision

Generating and validating abstracts of meeting conversations: a user study. InProceedings of the 6th International Natural Language Generation Con- ference. Association for Computational Linguistics. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak su- pervis...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024

SLUE phase-2: A benchmark suite of diverse spoken language understanding tasks. InProceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8906–8937, Toronto, Canada. Association for Computational Linguistics. Sotaro Takeshita, Tommaso Green, Ines Reinig, Kai Eckert, and Simone Ponzetto. 2024...

work page arXiv 2024
[7]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P

Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2024. Judging llm-as-a-judge with mt-bench and...

work page arXiv 2024
[8]

We segment the audio into one-minute chunks, encode each chunk using the encoder and then concatenate the encoded representations be- fore passing them through the adapter and LLM backbone

work page
[9]

Despite these adjustments, we encountered mem- ory limitations for audio files exceeding 35 minutes

We use a batch size of 1 for fine-tuning with NUTSHELL. Despite these adjustments, we encountered mem- ory limitations for audio files exceeding 35 minutes. In such cases, we truncate the audio to 35 minutes, which affects one example in the test set. The training of the models was conducted on four NVIDIA A100-SXM4-40GB GPUs. The con- trastive pretrainin...

work page
[10]

11 10 Model RougeL BERTScore Llama3.1-7B-Instruct F1↑F1↑Score with Expl.↑Plain Score↑Avg

and BERTScore (Zhang et al., 2020). 11 10 Model RougeL BERTScore Llama3.1-7B-Instruct F1↑F1↑Score with Expl.↑Plain Score↑Avg. Rank↓ Whisper + LLama31-Instruct 23.26 86.8177.75 84.30 1.23 Qwen2-Audio 16.26 84.94 48.42 39.50 3.47 End2End Finetuned24.47 86.7170.67 75.73 1.83 Table 6: Baseline Results, the finetuned model is a HuBERT + Qformer + LLama31Instru...

work page 2020
[11]

The results with Qwen-as-a-judge can be found in Table 4

as the judge, we obtain the same ranking as with Llama. The results with Qwen-as-a-judge can be found in Table 4. E Human Evaluation for Model Outputs We evaluate the models using ROUGE (Lin, 2004), BERTScore (Zhang et al., 2020), and LLM-as-a- 12https://huggingface.co/spaces/ evaluate-metric/bertscore judge. However, it is known that automatic evalu- ati...

work page 2004
[12]

** Relevance **: Does the predicted abstract capture the main points of the gold abstract ?\ n

work page
[13]

** Coherence **: Is the predicted abstract logically organized and easy to follow ?\ n

work page
[14]

** Conciseness **: Is the predicted abstract free from unnecessary details ?\ n

work page
[15]

- Provide a ** brief explanation ** for the assigned score .\ n \ n

** Factual Accuracy **: Are the claims in the predicted abstract consistent with the gold abstract ?\ n \ n For each criterion :\ n - Assign a ** score ** between 1 and 10 (1 = very poor , 10 = excellent ) .\ n " - Provide a ** brief explanation ** for the assigned score .\ n \ n " Your output must be in the following JSON format :\ n \ n " {\" relevance ...

work page

[1] [1]

International Speech Communication Associa- tion. Publisher Copyright: Copyright © 2021 ISCA.; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021. Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin...

work page 2021

[2] [2]

Qwen2-Audio Technical Report

Qwen2-audio technical report.Preprint, arXiv:2407.10759. Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tiona...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

The Llama 3 Herd of Models

A supervised approach to extractive sum- marisation of scientific papers. InProceedings of the 21st Conference on Computational Natural Lan- guage Learning (CoNLL 2017), pages 195–205, Van- couver, Canada. Association for Computational Lin- guistics. Daniel Deutsch and Dan Roth. 2020. SacreROUGE: An open-source library for using and developing sum- mariza...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

InProceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing, pages 1818–1828, Brussels, Belgium

Content selection in deep learning models of summarization. InProceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing, pages 1818–1828, Brussels, Belgium. Association for Computational Linguistics. Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi. 2019. Text Generation from Knowledge G...

work page arXiv 2018

[5] [5]

Robust Speech Recognition via Large-Scale Weak Supervision

Generating and validating abstracts of meeting conversations: a user study. InProceedings of the 6th International Natural Language Generation Con- ference. Association for Computational Linguistics. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak su- pervis...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Judging the judges: Evaluating alignment and vulner- abilities in llms-as-judges.arXiv preprint arXiv:2406.12624, 2024

SLUE phase-2: A benchmark suite of diverse spoken language understanding tasks. InProceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8906–8937, Toronto, Canada. Association for Computational Linguistics. Sotaro Takeshita, Tommaso Green, Ines Reinig, Kai Eckert, and Simone Ponzetto. 2024...

work page arXiv 2024

[7] [7]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P

Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2024. Judging llm-as-a-judge with mt-bench and...

work page arXiv 2024

[8] [8]

We segment the audio into one-minute chunks, encode each chunk using the encoder and then concatenate the encoded representations be- fore passing them through the adapter and LLM backbone

work page

[9] [9]

Despite these adjustments, we encountered mem- ory limitations for audio files exceeding 35 minutes

We use a batch size of 1 for fine-tuning with NUTSHELL. Despite these adjustments, we encountered mem- ory limitations for audio files exceeding 35 minutes. In such cases, we truncate the audio to 35 minutes, which affects one example in the test set. The training of the models was conducted on four NVIDIA A100-SXM4-40GB GPUs. The con- trastive pretrainin...

work page

[10] [10]

11 10 Model RougeL BERTScore Llama3.1-7B-Instruct F1↑F1↑Score with Expl.↑Plain Score↑Avg

and BERTScore (Zhang et al., 2020). 11 10 Model RougeL BERTScore Llama3.1-7B-Instruct F1↑F1↑Score with Expl.↑Plain Score↑Avg. Rank↓ Whisper + LLama31-Instruct 23.26 86.8177.75 84.30 1.23 Qwen2-Audio 16.26 84.94 48.42 39.50 3.47 End2End Finetuned24.47 86.7170.67 75.73 1.83 Table 6: Baseline Results, the finetuned model is a HuBERT + Qformer + LLama31Instru...

work page 2020

[11] [11]

The results with Qwen-as-a-judge can be found in Table 4

as the judge, we obtain the same ranking as with Llama. The results with Qwen-as-a-judge can be found in Table 4. E Human Evaluation for Model Outputs We evaluate the models using ROUGE (Lin, 2004), BERTScore (Zhang et al., 2020), and LLM-as-a- 12https://huggingface.co/spaces/ evaluate-metric/bertscore judge. However, it is known that automatic evalu- ati...

work page 2004

[12] [12]

** Relevance **: Does the predicted abstract capture the main points of the gold abstract ?\ n

work page

[13] [13]

** Coherence **: Is the predicted abstract logically organized and easy to follow ?\ n

work page

[14] [14]

** Conciseness **: Is the predicted abstract free from unnecessary details ?\ n

work page

[15] [15]

- Provide a ** brief explanation ** for the assigned score .\ n \ n

** Factual Accuracy **: Are the claims in the predicted abstract consistent with the gold abstract ?\ n \ n For each criterion :\ n - Assign a ** score ** between 1 and 10 (1 = very poor , 10 = excellent ) .\ n " - Provide a ** brief explanation ** for the assigned score .\ n \ n " Your output must be in the following JSON format :\ n \ n " {\" relevance ...

work page