Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline
Pith reviewed 2026-05-14 20:57 UTC · model grok-4.3
The pith
A reusable cross-domain pipeline for evaluating AI meeting summaries finds no significant accuracy differences among models but highlights retention advantages for gpt-5.1.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a fixed evaluation protocol with claim-grounded scoring, accuracy differences among gpt-4.1-mini, gpt-5-mini, and gpt-5.1 are not statistically significant under Holm correction on 114 meetings across three domains, though gpt-4.1-mini has the highest mean accuracy of 0.583, while gpt-5.1 leads significantly on retention with completeness of 0.886 and coverage of 0.942; the pipeline supports cross-domain reuse and online monitoring.
What carries the argument
The reusable evaluation pipeline that combines structured ground-truth claim construction, fixed candidate generation, claim-grounded scoring, and persisted reporting across domains.
If this is right
- Accuracy remains comparable across the tested models under this protocol, suggesting interchangeable use for basic summary tasks.
- Whitehouse press briefings emerge as an accuracy-hard regime that may need targeted model improvements.
- Retention metrics provide clearer separation than accuracy, favoring gpt-5.1 for summaries that capture more complete content.
- The same evaluation stack supports focused reruns with additional models like gpt-5.4 without altering judges or metrics.
- Privacy-bounded online interfaces enable active monitoring and regime detection without exposing customer data.
Where Pith is reading between the lines
- Similar claim-grounded pipelines could extend to summarization tasks in legal or medical domains where factual retention is critical.
- Organizations might integrate the monitoring interface to track directional performance trends and nominate models for deployment.
- Domain-specific slices suggest that accuracy-hard regimes like press briefings could benefit from model fine-tuning or prompt adjustments.
- Automating more of the judge process while preserving claim grounding might scale evaluations to thousands of meetings.
Load-bearing premise
Claim-grounded scoring by human or automated judges reliably measures summary quality without systematic bias, and structured ground-truth construction stays consistent across distinct meeting domains.
What would settle it
Replicating the full protocol on the same or expanded meeting set and finding statistically significant accuracy differences after Holm correction, or a reversal in which model leads on completeness and coverage.
Figures
read the original abstract
Industrial teams often deploy large language model features before stable regression or model selection evaluation exists. We present a reusable evaluation system for AI meeting summaries that combines structured ground-truth (GT) construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded online monitoring and nomination interface. The online evidence is not itself a benchmark: privacy-safe aggregate exports show active monitoring, hard regime detection, and directional movement without exposing customer data. We benchmark the offline path on 114 meetings across city_council, private_data, and whitehouse_press_briefings, yielding 340 completed meeting-model pairs and 680 judge runs for gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this fixed protocol, accuracy differences are not statistically significant under Holm correction (corrected p-values 0.053-0.448), although gpt-4.1-mini has the highest mean accuracy (0.583); the significant separation is on retention, where gpt-5.1 leads on completeness (0.886) and coverage (0.942). Typed slices isolate whitehouse_press_briefings as an accuracy-hard regime, and a later focused rerun over gpt-4.1, gpt-5-mini, and gpt-5.4 reuses the same stack under the same judges and metrics. This extended preprint keeps those core results aligned with the formal submission while adding a more detailed repository-level account of cross-domain reuse from the companion AI-search paper and an additional typed DeepEval contrastive analysis. Model naming note. Running text uses canonical model names on first introduction. Tables, filenames, and artifact IDs retain compact report labels for consistency with the packaged benchmark outputs. Table A maps the two conventions and is repeated in Section 4.3 where candidate-generation settings are defined.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a reusable cross-domain evaluation pipeline for AI meeting summaries, combining structured ground-truth construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded monitoring interface. It benchmarks gpt-4.1-mini, gpt-5-mini, and gpt-5.1 on 114 meetings across city_council, private_data, and whitehouse_press_briefings domains (340 meeting-model pairs, 680 judge runs), reporting non-significant accuracy differences under Holm correction (gpt-4.1-mini highest mean accuracy 0.583) but significant retention advantages for gpt-5.1 (completeness 0.886, coverage 0.942). Whitehouse_press_briefings is isolated as an accuracy-hard regime via typed slices; a later rerun reuses the stack with gpt-4.1, gpt-5-mini, and gpt-5.4 plus DeepEval contrast.
Significance. If the claim-grounded scoring and cross-domain ground-truth construction hold, the work supplies a practical, privacy-aware framework for industrial teams to perform ongoing model evaluation and hard-regime detection without exposing customer data. The scale of the empirical evaluation (340 pairs, 680 runs) and the reusable pipeline components offer concrete guidance for model selection and monitoring, with the domain-specific findings providing falsifiable predictions for future deployments.
major comments (2)
- [Abstract and evaluation protocol] Abstract and evaluation protocol: The central claims of non-significant accuracy differences and significant retention separation (gpt-5.1 leading on completeness 0.886 and coverage 0.942) rest on claim-grounded scoring, yet the manuscript reports no inter-rater agreement statistics, human calibration results, or ablation of judge type (human vs. automated). This omission is load-bearing because systematic bias in judges (e.g., favoring same-family models) would directly undermine the reported metric separations and the typed-slice identification of whitehouse_press_briefings as a hard regime.
- [Ground-truth construction and cross-domain analysis] Ground-truth construction and cross-domain analysis: No explicit consistency checks, inter-domain artifact controls, or validation of claim extraction are described for the structured GT across city_council, private_data, and whitehouse_press_briefings. This is load-bearing for the cross-domain reuse claims and the isolation of whitehouse_press_briefings as an accuracy-hard regime, as domain-specific biases in GT could artifactually produce the observed retention differences.
minor comments (2)
- [Section 4.3] Section 4.3: The model-naming convention note (canonical names in text vs. compact labels in tables) is helpful, but embedding the full mapping from Table A directly in the main text rather than referencing an appendix would reduce reader friction.
- [Statistical reporting] Statistical reporting: The abstract states Holm-corrected p-values of 0.053-0.448 for accuracy; adding the exact test statistic (e.g., paired t-test or Wilcoxon) and confirming all pairwise comparisons were included would clarify the non-significance conclusion without altering the core result.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below, indicating where we will revise the manuscript to strengthen the evaluation protocol and ground-truth validation while preserving the core contributions of the reusable pipeline.
read point-by-point responses
-
Referee: [Abstract and evaluation protocol] Abstract and evaluation protocol: The central claims of non-significant accuracy differences and significant retention separation (gpt-5.1 leading on completeness 0.886 and coverage 0.942) rest on claim-grounded scoring, yet the manuscript reports no inter-rater agreement statistics, human calibration results, or ablation of judge type (human vs. automated). This omission is load-bearing because systematic bias in judges (e.g., favoring same-family models) would directly undermine the reported metric separations and the typed-slice identification of whitehouse_press_briefings as a hard regime.
Authors: We agree that the absence of inter-rater agreement statistics and human calibration is a limitation that should be addressed. The pipeline is intentionally designed around fixed automated judges to ensure reproducibility and privacy compliance across deployments. In the revised manuscript we will add a dedicated subsection (Section 4.4) reporting Cohen's kappa on a 10% random sample of claims where two human annotators independently scored against the automated judge outputs. We will also include a limited ablation comparing human vs. automated scoring on 20 meetings to quantify any systematic bias. These additions directly support the retention separations and hard-regime identification without altering the primary automated results or the reported p-values. revision: yes
-
Referee: [Ground-truth construction and cross-domain analysis] Ground-truth construction and cross-domain analysis: No explicit consistency checks, inter-domain artifact controls, or validation of claim extraction are described for the structured GT across city_council, private_data, and whitehouse_press_briefings. This is load-bearing for the cross-domain reuse claims and the isolation of whitehouse_press_briefings as an accuracy-hard regime, as domain-specific biases in GT could artifactually produce the observed retention differences.
Authors: We acknowledge that the manuscript does not explicitly document consistency checks or inter-domain controls for the structured ground truth. The GT construction follows a fixed claim-extraction protocol described in Section 3, but validation metrics were omitted. In the revision we will expand Section 3.2 with (i) inter-annotator agreement (Fleiss' kappa) computed on a 15% overlap subset of meetings, (ii) cross-domain claim-overlap statistics showing that whitehouse_press_briefings claims remain distinct in type distribution, and (iii) an artifact-control table confirming that retention differences persist after normalizing for domain-specific claim density. These additions will substantiate the cross-domain reuse claims and the typed-slice isolation of the accuracy-hard regime. revision: yes
Circularity Check
No significant circularity; empirical results from independent GT comparison
full rationale
The paper presents an empirical evaluation pipeline with structured ground-truth construction, fixed model candidate generation, and claim-grounded scoring across 114 meetings in three domains. Central claims (non-significant accuracy differences under Holm correction; gpt-5.1 retention leads) are direct statistical comparisons against this externally constructed GT, not reductions of predictions to fitted inputs or self-definitions. No equations, ansatzes, or uniqueness theorems are invoked that collapse to prior self-citations or renamings. The protocol is reusable and benchmarked against held-out meetings, satisfying self-containment against external data.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Claim-grounded scoring by judges accurately captures summary quality without bias
- domain assumption Structured ground-truth construction is consistent and representative across domains
Reference graph
Works this paper leans on
-
[1]
Philip Zhong, Kent Chen, and Don Wang. 2025. Evaluating Embedding Models and Pipeline Optimization for AI Search Quality. arXiv preprint arXiv:2511.22240. https://doi.org/10.48550/arXiv.2511.22240
-
[2]
Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, and Chuck Wooters. 2003. The ICSI Meeting Corpus. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), volume 1, pages 364-367. https://doi.org/10.1109...
-
[3]
Iain McCowan, Jean Carletta, Wessel Kraaij, S. Ashby, Sandrine Bourban, Mike Flynn, Mathieu Guillemot, Thomas Hain, Jan Kadlec, Vasilis Karaiskos, Michael Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnieszka Lisowska, Will Post, Dennis Reidsma, and Pete Wellner. 2005. The AMI Meeting Corpus. In Proceedings of the 5th International Conference on Methods ...
work page 2005
-
[4]
Ming Zhong, Da Yin, Tao Yu, Ahmed Hassan Awadallah, Xipeng Qiu, and Jiawei Han. 2021. QMSum: A New Benchmark for Query-Based Multi-Domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://aclanthology.org/2021.naacl-main.472/
work page 2021
-
[5]
Yue Hu, Tzviya Ganter, Hanieh Deilamsalehy, Franck Dernoncourt, Hassan Foroosh, and Fei Liu. 2023. MeetingBank: A Benchmark Dataset for Meeting Summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2023.acl-long.906
-
[6]
Soomin Kim, Seongyun Weon, Jinhwi Kim, and Hyunjoong Ko. 2023. ExplainMeetSum: An Explainable Meeting Summarization Benchmark. In Findings of the Association for Computational Linguistics: EMNLP
work page 2023
-
[7]
https://aclanthology.org/2023.findings-emnlp.573/
work page 2023
-
[8]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906-1919. Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.173/
work page 2020
-
[9]
Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. 2022. SummaC: Re-Visiting NLI- Based Models for Inconsistency Detection in Summarization. Transactions of the Association for Computational Linguistics, 10. https://doi.org/10.1162/tacl_a_00453
-
[10]
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Julian Michael, Niloofar Mireshghallah, Khyathi Chandu, Eric Wallace, Emily Dinan, Ashish Sabharwal, and Adina Williams. 2021. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics...
work page 2021
-
[11]
Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2024. RAGAS: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150-158. https://aclanthology.org/2024.eacl-demo.16/
work page 2024
-
[12]
RAGAS Documentation. 2026. Metrics Overview. https://docs.ragas.io/en/stable/concepts/metrics/overview/. Accessed April 17, 2026
work page 2026
-
[13]
TruLens Documentation. 2026. Documentation Index. https://www.trulens.org/docs/. Accessed April 17, 2026
work page 2026
-
[14]
Confident AI Documentation. 2026. LLM Evaluation Documentation. https://www.confident-ai.com/docs. Accessed April 17, 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.