Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

Don Wang; Jason Zhang; Kent Chen; Philip Zhong

arxiv: 2604.21345 · v2 · submitted 2026-04-23 · 💻 cs.AI · cs.CL

Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

Philip Zhong , Don Wang , Jason Zhang , Kent Chen This is my paper

Pith reviewed 2026-05-14 20:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords AI meeting summariesevaluation pipelinecross-domain benchmarkingclaim-grounded scoringLLM evaluationretention metricsstatistical significancereusable evaluation system

0 comments

The pith

A reusable cross-domain pipeline for evaluating AI meeting summaries finds no significant accuracy differences among models but highlights retention advantages for gpt-5.1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a reusable evaluation system for AI-generated meeting summaries that integrates structured ground-truth construction from meetings, fixed model candidate generation, claim-grounded scoring by judges, and privacy-bounded monitoring. Benchmarked on 114 meetings from three distinct domains using 340 model pairs, the system shows that accuracy scores for gpt-4.1-mini, gpt-5-mini, and gpt-5.1 are not statistically significant after Holm correction, with gpt-4.1-mini at the highest mean of 0.583. However, retention metrics separate the models clearly, with gpt-5.1 achieving the highest completeness at 0.886 and coverage at 0.942. This matters for industrial teams deploying LLMs without established regression tests, as the pipeline provides consistent, reusable comparison and monitoring without exposing private data.

Core claim

Under a fixed evaluation protocol with claim-grounded scoring, accuracy differences among gpt-4.1-mini, gpt-5-mini, and gpt-5.1 are not statistically significant under Holm correction on 114 meetings across three domains, though gpt-4.1-mini has the highest mean accuracy of 0.583, while gpt-5.1 leads significantly on retention with completeness of 0.886 and coverage of 0.942; the pipeline supports cross-domain reuse and online monitoring.

What carries the argument

The reusable evaluation pipeline that combines structured ground-truth claim construction, fixed candidate generation, claim-grounded scoring, and persisted reporting across domains.

If this is right

Accuracy remains comparable across the tested models under this protocol, suggesting interchangeable use for basic summary tasks.
Whitehouse press briefings emerge as an accuracy-hard regime that may need targeted model improvements.
Retention metrics provide clearer separation than accuracy, favoring gpt-5.1 for summaries that capture more complete content.
The same evaluation stack supports focused reruns with additional models like gpt-5.4 without altering judges or metrics.
Privacy-bounded online interfaces enable active monitoring and regime detection without exposing customer data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar claim-grounded pipelines could extend to summarization tasks in legal or medical domains where factual retention is critical.
Organizations might integrate the monitoring interface to track directional performance trends and nominate models for deployment.
Domain-specific slices suggest that accuracy-hard regimes like press briefings could benefit from model fine-tuning or prompt adjustments.
Automating more of the judge process while preserving claim grounding might scale evaluations to thousands of meetings.

Load-bearing premise

Claim-grounded scoring by human or automated judges reliably measures summary quality without systematic bias, and structured ground-truth construction stays consistent across distinct meeting domains.

What would settle it

Replicating the full protocol on the same or expanded meeting set and finding statistically significant accuracy differences after Holm correction, or a reversal in which model leads on completeness and coverage.

Figures

Figures reproduced from arXiv: 2604.21345 by Don Wang, Jason Zhang, Kent Chen, Philip Zhong.

**Figure 1.** Figure 1: Reusable quality-loop architecture for AI meeting-summary evaluation. Solid components correspond to the [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Workflow schematic for the meeting-summary evaluation pipeline and its packaged artifacts. Transcript assets feed [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Industrial teams often deploy large language model features before stable regression or model selection evaluation exists. We present a reusable evaluation system for AI meeting summaries that combines structured ground-truth (GT) construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded online monitoring and nomination interface. The online evidence is not itself a benchmark: privacy-safe aggregate exports show active monitoring, hard regime detection, and directional movement without exposing customer data. We benchmark the offline path on 114 meetings across city_council, private_data, and whitehouse_press_briefings, yielding 340 completed meeting-model pairs and 680 judge runs for gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this fixed protocol, accuracy differences are not statistically significant under Holm correction (corrected p-values 0.053-0.448), although gpt-4.1-mini has the highest mean accuracy (0.583); the significant separation is on retention, where gpt-5.1 leads on completeness (0.886) and coverage (0.942). Typed slices isolate whitehouse_press_briefings as an accuracy-hard regime, and a later focused rerun over gpt-4.1, gpt-5-mini, and gpt-5.4 reuses the same stack under the same judges and metrics. This extended preprint keeps those core results aligned with the formal submission while adding a more detailed repository-level account of cross-domain reuse from the companion AI-search paper and an additional typed DeepEval contrastive analysis. Model naming note. Running text uses canonical model names on first introduction. Tables, filenames, and artifact IDs retain compact report labels for consistency with the packaged benchmark outputs. Table A maps the two conventions and is repeated in Section 4.3 where candidate-generation settings are defined.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical reusable pipeline for AI meeting summary evaluation with concrete retention findings, but the claim-grounded scoring needs better validation evidence.

read the letter

The reusable pipeline for evaluating meeting summaries stands out here, along with the finding that retention metrics separate the models more clearly than accuracy does under their fixed protocol. The paper builds this system by combining structured ground-truth construction, fixed candidate generation, claim-grounded scoring, and cross-domain typed slicing. They test it on 114 meetings from city_council, private_data, and whitehouse_press_briefings domains, completing 340 meeting-model pairs and 680 judge runs for variants like gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Accuracy shows no significant differences after Holm correction, with gpt-4.1-mini at 0.583 mean, but gpt-5.1 leads on completeness at 0.886 and coverage at 0.942. The privacy-bounded online monitoring allows tracking without data exposure, and they include a later rerun plus DeepEval contrast. This setup is new in its specific combination for meeting summaries and the emphasis on reuse and privacy. The concrete metrics and statistical handling give it practical weight. The typed slices identifying whitehouse_press_briefings as hard also add actionable insight. The main soft spot is the validation of the claim-grounded scoring. Details on inter-rater agreement, judge calibration, or bias checks are missing from the provided sections. Without those, the retention leads and domain hardness claims rest on an assumption that the scoring is unbiased and consistent across domains. Ground-truth construction uniformity is another area that could use more explicit support. This paper targets industrial teams and applied researchers who need deployable benchmarks for summarization under privacy rules. Engineers looking for a ready pipeline and benchmark results will find it directly useful. I would bring it to the next reading group to walk through the pipeline mechanics. It deserves peer review because the empirical work and reuse focus are substantial enough for referees to engage with, even with room to strengthen the scoring validation.

Referee Report

2 major / 2 minor

Summary. The paper presents a reusable cross-domain evaluation pipeline for AI meeting summaries, combining structured ground-truth construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded monitoring interface. It benchmarks gpt-4.1-mini, gpt-5-mini, and gpt-5.1 on 114 meetings across city_council, private_data, and whitehouse_press_briefings domains (340 meeting-model pairs, 680 judge runs), reporting non-significant accuracy differences under Holm correction (gpt-4.1-mini highest mean accuracy 0.583) but significant retention advantages for gpt-5.1 (completeness 0.886, coverage 0.942). Whitehouse_press_briefings is isolated as an accuracy-hard regime via typed slices; a later rerun reuses the stack with gpt-4.1, gpt-5-mini, and gpt-5.4 plus DeepEval contrast.

Significance. If the claim-grounded scoring and cross-domain ground-truth construction hold, the work supplies a practical, privacy-aware framework for industrial teams to perform ongoing model evaluation and hard-regime detection without exposing customer data. The scale of the empirical evaluation (340 pairs, 680 runs) and the reusable pipeline components offer concrete guidance for model selection and monitoring, with the domain-specific findings providing falsifiable predictions for future deployments.

major comments (2)

[Abstract and evaluation protocol] Abstract and evaluation protocol: The central claims of non-significant accuracy differences and significant retention separation (gpt-5.1 leading on completeness 0.886 and coverage 0.942) rest on claim-grounded scoring, yet the manuscript reports no inter-rater agreement statistics, human calibration results, or ablation of judge type (human vs. automated). This omission is load-bearing because systematic bias in judges (e.g., favoring same-family models) would directly undermine the reported metric separations and the typed-slice identification of whitehouse_press_briefings as a hard regime.
[Ground-truth construction and cross-domain analysis] Ground-truth construction and cross-domain analysis: No explicit consistency checks, inter-domain artifact controls, or validation of claim extraction are described for the structured GT across city_council, private_data, and whitehouse_press_briefings. This is load-bearing for the cross-domain reuse claims and the isolation of whitehouse_press_briefings as an accuracy-hard regime, as domain-specific biases in GT could artifactually produce the observed retention differences.

minor comments (2)

[Section 4.3] Section 4.3: The model-naming convention note (canonical names in text vs. compact labels in tables) is helpful, but embedding the full mapping from Table A directly in the main text rather than referencing an appendix would reduce reader friction.
[Statistical reporting] Statistical reporting: The abstract states Holm-corrected p-values of 0.053-0.448 for accuracy; adding the exact test statistic (e.g., paired t-test or Wilcoxon) and confirming all pairwise comparisons were included would clarify the non-significance conclusion without altering the core result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating where we will revise the manuscript to strengthen the evaluation protocol and ground-truth validation while preserving the core contributions of the reusable pipeline.

read point-by-point responses

Referee: [Abstract and evaluation protocol] Abstract and evaluation protocol: The central claims of non-significant accuracy differences and significant retention separation (gpt-5.1 leading on completeness 0.886 and coverage 0.942) rest on claim-grounded scoring, yet the manuscript reports no inter-rater agreement statistics, human calibration results, or ablation of judge type (human vs. automated). This omission is load-bearing because systematic bias in judges (e.g., favoring same-family models) would directly undermine the reported metric separations and the typed-slice identification of whitehouse_press_briefings as a hard regime.

Authors: We agree that the absence of inter-rater agreement statistics and human calibration is a limitation that should be addressed. The pipeline is intentionally designed around fixed automated judges to ensure reproducibility and privacy compliance across deployments. In the revised manuscript we will add a dedicated subsection (Section 4.4) reporting Cohen's kappa on a 10% random sample of claims where two human annotators independently scored against the automated judge outputs. We will also include a limited ablation comparing human vs. automated scoring on 20 meetings to quantify any systematic bias. These additions directly support the retention separations and hard-regime identification without altering the primary automated results or the reported p-values. revision: yes
Referee: [Ground-truth construction and cross-domain analysis] Ground-truth construction and cross-domain analysis: No explicit consistency checks, inter-domain artifact controls, or validation of claim extraction are described for the structured GT across city_council, private_data, and whitehouse_press_briefings. This is load-bearing for the cross-domain reuse claims and the isolation of whitehouse_press_briefings as an accuracy-hard regime, as domain-specific biases in GT could artifactually produce the observed retention differences.

Authors: We acknowledge that the manuscript does not explicitly document consistency checks or inter-domain controls for the structured ground truth. The GT construction follows a fixed claim-extraction protocol described in Section 3, but validation metrics were omitted. In the revision we will expand Section 3.2 with (i) inter-annotator agreement (Fleiss' kappa) computed on a 15% overlap subset of meetings, (ii) cross-domain claim-overlap statistics showing that whitehouse_press_briefings claims remain distinct in type distribution, and (iii) an artifact-control table confirming that retention differences persist after normalizing for domain-specific claim density. These additions will substantiate the cross-domain reuse claims and the typed-slice isolation of the accuracy-hard regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from independent GT comparison

full rationale

The paper presents an empirical evaluation pipeline with structured ground-truth construction, fixed model candidate generation, and claim-grounded scoring across 114 meetings in three domains. Central claims (non-significant accuracy differences under Holm correction; gpt-5.1 retention leads) are direct statistical comparisons against this externally constructed GT, not reductions of predictions to fitted inputs or self-definitions. No equations, ansatzes, or uniqueness theorems are invoked that collapse to prior self-citations or renamings. The protocol is reusable and benchmarked against held-out meetings, satisfying self-containment against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The evaluation rests primarily on domain assumptions about ground-truth quality and judge reliability rather than new free parameters or invented entities.

axioms (2)

domain assumption Claim-grounded scoring by judges accurately captures summary quality without bias
Invoked in the description of the scoring component and metric definitions.
domain assumption Structured ground-truth construction is consistent and representative across domains
Required for the typed slices and cross-domain comparisons to be valid.

pith-pipeline@v0.9.0 · 5643 in / 1282 out tokens · 42936 ms · 2026-05-14T20:57:26.599288+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Philip Zhong, Kent Chen, and Don Wang. 2025. Evaluating Embedding Models and Pipeline Optimization for AI Search Quality. arXiv preprint arXiv:2511.22240. https://doi.org/10.48550/arXiv.2511.22240

work page doi:10.48550/arxiv.2511.22240 2025
[2]

Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, and Chuck Wooters. 2003. The ICSI Meeting Corpus. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), volume 1, pages 364-367. https://doi.org/10.1109...

work page doi:10.1109/icassp.2003.1198793 2003
[3]

Iain McCowan, Jean Carletta, Wessel Kraaij, S. Ashby, Sandrine Bourban, Mike Flynn, Mathieu Guillemot, Thomas Hain, Jan Kadlec, Vasilis Karaiskos, Michael Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnieszka Lisowska, Will Post, Dennis Reidsma, and Pete Wellner. 2005. The AMI Meeting Corpus. In Proceedings of the 5th International Conference on Methods ...

work page 2005
[4]

Ming Zhong, Da Yin, Tao Yu, Ahmed Hassan Awadallah, Xipeng Qiu, and Jiawei Han. 2021. QMSum: A New Benchmark for Query-Based Multi-Domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://aclanthology.org/2021.naacl-main.472/

work page 2021
[5]

Yue Hu, Tzviya Ganter, Hanieh Deilamsalehy, Franck Dernoncourt, Hassan Foroosh, and Fei Liu. 2023. MeetingBank: A Benchmark Dataset for Meeting Summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2023.acl-long.906

work page doi:10.18653/v1/2023.acl-long.906 2023
[6]

Soomin Kim, Seongyun Weon, Jinhwi Kim, and Hyunjoong Ko. 2023. ExplainMeetSum: An Explainable Meeting Summarization Benchmark. In Findings of the Association for Computational Linguistics: EMNLP

work page 2023
[7]

https://aclanthology.org/2023.findings-emnlp.573/

work page 2023
[8]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906-1919. Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.173/

work page 2020
[9]

and Hearst, Marti A

Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. 2022. SummaC: Re-Visiting NLI- Based Models for Inconsistency Detection in Summarization. Transactions of the Association for Computational Linguistics, 10. https://doi.org/10.1162/tacl_a_00453

work page doi:10.1162/tacl_a_00453 2022
[10]

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Julian Michael, Niloofar Mireshghallah, Khyathi Chandu, Eric Wallace, Emily Dinan, Ashish Sabharwal, and Adina Williams. 2021. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics...

work page 2021
[11]

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2024. RAGAS: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150-158. https://aclanthology.org/2024.eacl-demo.16/

work page 2024
[12]

RAGAS Documentation. 2026. Metrics Overview. https://docs.ragas.io/en/stable/concepts/metrics/overview/. Accessed April 17, 2026

work page 2026
[13]

TruLens Documentation. 2026. Documentation Index. https://www.trulens.org/docs/. Accessed April 17, 2026

work page 2026
[14]

Confident AI Documentation. 2026. LLM Evaluation Documentation. https://www.confident-ai.com/docs. Accessed April 17, 2026

work page 2026

[1] [1]

Philip Zhong, Kent Chen, and Don Wang. 2025. Evaluating Embedding Models and Pipeline Optimization for AI Search Quality. arXiv preprint arXiv:2511.22240. https://doi.org/10.48550/arXiv.2511.22240

work page doi:10.48550/arxiv.2511.22240 2025

[2] [2]

Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, and Chuck Wooters. 2003. The ICSI Meeting Corpus. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), volume 1, pages 364-367. https://doi.org/10.1109...

work page doi:10.1109/icassp.2003.1198793 2003

[3] [3]

Iain McCowan, Jean Carletta, Wessel Kraaij, S. Ashby, Sandrine Bourban, Mike Flynn, Mathieu Guillemot, Thomas Hain, Jan Kadlec, Vasilis Karaiskos, Michael Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnieszka Lisowska, Will Post, Dennis Reidsma, and Pete Wellner. 2005. The AMI Meeting Corpus. In Proceedings of the 5th International Conference on Methods ...

work page 2005

[4] [4]

Ming Zhong, Da Yin, Tao Yu, Ahmed Hassan Awadallah, Xipeng Qiu, and Jiawei Han. 2021. QMSum: A New Benchmark for Query-Based Multi-Domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://aclanthology.org/2021.naacl-main.472/

work page 2021

[5] [5]

Yue Hu, Tzviya Ganter, Hanieh Deilamsalehy, Franck Dernoncourt, Hassan Foroosh, and Fei Liu. 2023. MeetingBank: A Benchmark Dataset for Meeting Summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2023.acl-long.906

work page doi:10.18653/v1/2023.acl-long.906 2023

[6] [6]

Soomin Kim, Seongyun Weon, Jinhwi Kim, and Hyunjoong Ko. 2023. ExplainMeetSum: An Explainable Meeting Summarization Benchmark. In Findings of the Association for Computational Linguistics: EMNLP

work page 2023

[7] [7]

https://aclanthology.org/2023.findings-emnlp.573/

work page 2023

[8] [8]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906-1919. Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.173/

work page 2020

[9] [9]

and Hearst, Marti A

Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. 2022. SummaC: Re-Visiting NLI- Based Models for Inconsistency Detection in Summarization. Transactions of the Association for Computational Linguistics, 10. https://doi.org/10.1162/tacl_a_00453

work page doi:10.1162/tacl_a_00453 2022

[10] [10]

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Julian Michael, Niloofar Mireshghallah, Khyathi Chandu, Eric Wallace, Emily Dinan, Ashish Sabharwal, and Adina Williams. 2021. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics...

work page 2021

[11] [11]

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2024. RAGAS: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150-158. https://aclanthology.org/2024.eacl-demo.16/

work page 2024

[12] [12]

RAGAS Documentation. 2026. Metrics Overview. https://docs.ragas.io/en/stable/concepts/metrics/overview/. Accessed April 17, 2026

work page 2026

[13] [13]

TruLens Documentation. 2026. Documentation Index. https://www.trulens.org/docs/. Accessed April 17, 2026

work page 2026

[14] [14]

Confident AI Documentation. 2026. LLM Evaluation Documentation. https://www.confident-ai.com/docs. Accessed April 17, 2026

work page 2026