pith. sign in

arxiv: 2604.21345 · v2 · submitted 2026-04-23 · 💻 cs.AI · cs.CL

Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

Pith reviewed 2026-05-14 20:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords AI meeting summariesevaluation pipelinecross-domain benchmarkingclaim-grounded scoringLLM evaluationretention metricsstatistical significancereusable evaluation system
0
0 comments X

The pith

A reusable cross-domain pipeline for evaluating AI meeting summaries finds no significant accuracy differences among models but highlights retention advantages for gpt-5.1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a reusable evaluation system for AI-generated meeting summaries that integrates structured ground-truth construction from meetings, fixed model candidate generation, claim-grounded scoring by judges, and privacy-bounded monitoring. Benchmarked on 114 meetings from three distinct domains using 340 model pairs, the system shows that accuracy scores for gpt-4.1-mini, gpt-5-mini, and gpt-5.1 are not statistically significant after Holm correction, with gpt-4.1-mini at the highest mean of 0.583. However, retention metrics separate the models clearly, with gpt-5.1 achieving the highest completeness at 0.886 and coverage at 0.942. This matters for industrial teams deploying LLMs without established regression tests, as the pipeline provides consistent, reusable comparison and monitoring without exposing private data.

Core claim

Under a fixed evaluation protocol with claim-grounded scoring, accuracy differences among gpt-4.1-mini, gpt-5-mini, and gpt-5.1 are not statistically significant under Holm correction on 114 meetings across three domains, though gpt-4.1-mini has the highest mean accuracy of 0.583, while gpt-5.1 leads significantly on retention with completeness of 0.886 and coverage of 0.942; the pipeline supports cross-domain reuse and online monitoring.

What carries the argument

The reusable evaluation pipeline that combines structured ground-truth claim construction, fixed candidate generation, claim-grounded scoring, and persisted reporting across domains.

If this is right

  • Accuracy remains comparable across the tested models under this protocol, suggesting interchangeable use for basic summary tasks.
  • Whitehouse press briefings emerge as an accuracy-hard regime that may need targeted model improvements.
  • Retention metrics provide clearer separation than accuracy, favoring gpt-5.1 for summaries that capture more complete content.
  • The same evaluation stack supports focused reruns with additional models like gpt-5.4 without altering judges or metrics.
  • Privacy-bounded online interfaces enable active monitoring and regime detection without exposing customer data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar claim-grounded pipelines could extend to summarization tasks in legal or medical domains where factual retention is critical.
  • Organizations might integrate the monitoring interface to track directional performance trends and nominate models for deployment.
  • Domain-specific slices suggest that accuracy-hard regimes like press briefings could benefit from model fine-tuning or prompt adjustments.
  • Automating more of the judge process while preserving claim grounding might scale evaluations to thousands of meetings.

Load-bearing premise

Claim-grounded scoring by human or automated judges reliably measures summary quality without systematic bias, and structured ground-truth construction stays consistent across distinct meeting domains.

What would settle it

Replicating the full protocol on the same or expanded meeting set and finding statistically significant accuracy differences after Holm correction, or a reversal in which model leads on completeness and coverage.

Figures

Figures reproduced from arXiv: 2604.21345 by Don Wang, Jason Zhang, Kent Chen, Philip Zhong.

Figure 1
Figure 1. Figure 1: Reusable quality-loop architecture for AI meeting-summary evaluation. Solid components correspond to the [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow schematic for the meeting-summary evaluation pipeline and its packaged artifacts. Transcript assets feed [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Industrial teams often deploy large language model features before stable regression or model selection evaluation exists. We present a reusable evaluation system for AI meeting summaries that combines structured ground-truth (GT) construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded online monitoring and nomination interface. The online evidence is not itself a benchmark: privacy-safe aggregate exports show active monitoring, hard regime detection, and directional movement without exposing customer data. We benchmark the offline path on 114 meetings across city_council, private_data, and whitehouse_press_briefings, yielding 340 completed meeting-model pairs and 680 judge runs for gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this fixed protocol, accuracy differences are not statistically significant under Holm correction (corrected p-values 0.053-0.448), although gpt-4.1-mini has the highest mean accuracy (0.583); the significant separation is on retention, where gpt-5.1 leads on completeness (0.886) and coverage (0.942). Typed slices isolate whitehouse_press_briefings as an accuracy-hard regime, and a later focused rerun over gpt-4.1, gpt-5-mini, and gpt-5.4 reuses the same stack under the same judges and metrics. This extended preprint keeps those core results aligned with the formal submission while adding a more detailed repository-level account of cross-domain reuse from the companion AI-search paper and an additional typed DeepEval contrastive analysis. Model naming note. Running text uses canonical model names on first introduction. Tables, filenames, and artifact IDs retain compact report labels for consistency with the packaged benchmark outputs. Table A maps the two conventions and is repeated in Section 4.3 where candidate-generation settings are defined.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a reusable cross-domain evaluation pipeline for AI meeting summaries, combining structured ground-truth construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded monitoring interface. It benchmarks gpt-4.1-mini, gpt-5-mini, and gpt-5.1 on 114 meetings across city_council, private_data, and whitehouse_press_briefings domains (340 meeting-model pairs, 680 judge runs), reporting non-significant accuracy differences under Holm correction (gpt-4.1-mini highest mean accuracy 0.583) but significant retention advantages for gpt-5.1 (completeness 0.886, coverage 0.942). Whitehouse_press_briefings is isolated as an accuracy-hard regime via typed slices; a later rerun reuses the stack with gpt-4.1, gpt-5-mini, and gpt-5.4 plus DeepEval contrast.

Significance. If the claim-grounded scoring and cross-domain ground-truth construction hold, the work supplies a practical, privacy-aware framework for industrial teams to perform ongoing model evaluation and hard-regime detection without exposing customer data. The scale of the empirical evaluation (340 pairs, 680 runs) and the reusable pipeline components offer concrete guidance for model selection and monitoring, with the domain-specific findings providing falsifiable predictions for future deployments.

major comments (2)
  1. [Abstract and evaluation protocol] Abstract and evaluation protocol: The central claims of non-significant accuracy differences and significant retention separation (gpt-5.1 leading on completeness 0.886 and coverage 0.942) rest on claim-grounded scoring, yet the manuscript reports no inter-rater agreement statistics, human calibration results, or ablation of judge type (human vs. automated). This omission is load-bearing because systematic bias in judges (e.g., favoring same-family models) would directly undermine the reported metric separations and the typed-slice identification of whitehouse_press_briefings as a hard regime.
  2. [Ground-truth construction and cross-domain analysis] Ground-truth construction and cross-domain analysis: No explicit consistency checks, inter-domain artifact controls, or validation of claim extraction are described for the structured GT across city_council, private_data, and whitehouse_press_briefings. This is load-bearing for the cross-domain reuse claims and the isolation of whitehouse_press_briefings as an accuracy-hard regime, as domain-specific biases in GT could artifactually produce the observed retention differences.
minor comments (2)
  1. [Section 4.3] Section 4.3: The model-naming convention note (canonical names in text vs. compact labels in tables) is helpful, but embedding the full mapping from Table A directly in the main text rather than referencing an appendix would reduce reader friction.
  2. [Statistical reporting] Statistical reporting: The abstract states Holm-corrected p-values of 0.053-0.448 for accuracy; adding the exact test statistic (e.g., paired t-test or Wilcoxon) and confirming all pairwise comparisons were included would clarify the non-significance conclusion without altering the core result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating where we will revise the manuscript to strengthen the evaluation protocol and ground-truth validation while preserving the core contributions of the reusable pipeline.

read point-by-point responses
  1. Referee: [Abstract and evaluation protocol] Abstract and evaluation protocol: The central claims of non-significant accuracy differences and significant retention separation (gpt-5.1 leading on completeness 0.886 and coverage 0.942) rest on claim-grounded scoring, yet the manuscript reports no inter-rater agreement statistics, human calibration results, or ablation of judge type (human vs. automated). This omission is load-bearing because systematic bias in judges (e.g., favoring same-family models) would directly undermine the reported metric separations and the typed-slice identification of whitehouse_press_briefings as a hard regime.

    Authors: We agree that the absence of inter-rater agreement statistics and human calibration is a limitation that should be addressed. The pipeline is intentionally designed around fixed automated judges to ensure reproducibility and privacy compliance across deployments. In the revised manuscript we will add a dedicated subsection (Section 4.4) reporting Cohen's kappa on a 10% random sample of claims where two human annotators independently scored against the automated judge outputs. We will also include a limited ablation comparing human vs. automated scoring on 20 meetings to quantify any systematic bias. These additions directly support the retention separations and hard-regime identification without altering the primary automated results or the reported p-values. revision: yes

  2. Referee: [Ground-truth construction and cross-domain analysis] Ground-truth construction and cross-domain analysis: No explicit consistency checks, inter-domain artifact controls, or validation of claim extraction are described for the structured GT across city_council, private_data, and whitehouse_press_briefings. This is load-bearing for the cross-domain reuse claims and the isolation of whitehouse_press_briefings as an accuracy-hard regime, as domain-specific biases in GT could artifactually produce the observed retention differences.

    Authors: We acknowledge that the manuscript does not explicitly document consistency checks or inter-domain controls for the structured ground truth. The GT construction follows a fixed claim-extraction protocol described in Section 3, but validation metrics were omitted. In the revision we will expand Section 3.2 with (i) inter-annotator agreement (Fleiss' kappa) computed on a 15% overlap subset of meetings, (ii) cross-domain claim-overlap statistics showing that whitehouse_press_briefings claims remain distinct in type distribution, and (iii) an artifact-control table confirming that retention differences persist after normalizing for domain-specific claim density. These additions will substantiate the cross-domain reuse claims and the typed-slice isolation of the accuracy-hard regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from independent GT comparison

full rationale

The paper presents an empirical evaluation pipeline with structured ground-truth construction, fixed model candidate generation, and claim-grounded scoring across 114 meetings in three domains. Central claims (non-significant accuracy differences under Holm correction; gpt-5.1 retention leads) are direct statistical comparisons against this externally constructed GT, not reductions of predictions to fitted inputs or self-definitions. No equations, ansatzes, or uniqueness theorems are invoked that collapse to prior self-citations or renamings. The protocol is reusable and benchmarked against held-out meetings, satisfying self-containment against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The evaluation rests primarily on domain assumptions about ground-truth quality and judge reliability rather than new free parameters or invented entities.

axioms (2)
  • domain assumption Claim-grounded scoring by judges accurately captures summary quality without bias
    Invoked in the description of the scoring component and metric definitions.
  • domain assumption Structured ground-truth construction is consistent and representative across domains
    Required for the typed slices and cross-domain comparisons to be valid.

pith-pipeline@v0.9.0 · 5643 in / 1282 out tokens · 42936 ms · 2026-05-14T20:57:26.599288+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Philip Zhong, Kent Chen, and Don Wang. 2025. Evaluating Embedding Models and Pipeline Optimization for AI Search Quality. arXiv preprint arXiv:2511.22240. https://doi.org/10.48550/arXiv.2511.22240

  2. [2]

    Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, and Chuck Wooters. 2003. The ICSI Meeting Corpus. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), volume 1, pages 364-367. https://doi.org/10.1109...

  3. [3]

    Iain McCowan, Jean Carletta, Wessel Kraaij, S. Ashby, Sandrine Bourban, Mike Flynn, Mathieu Guillemot, Thomas Hain, Jan Kadlec, Vasilis Karaiskos, Michael Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnieszka Lisowska, Will Post, Dennis Reidsma, and Pete Wellner. 2005. The AMI Meeting Corpus. In Proceedings of the 5th International Conference on Methods ...

  4. [4]

    Ming Zhong, Da Yin, Tao Yu, Ahmed Hassan Awadallah, Xipeng Qiu, and Jiawei Han. 2021. QMSum: A New Benchmark for Query-Based Multi-Domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://aclanthology.org/2021.naacl-main.472/

  5. [5]

    Yue Hu, Tzviya Ganter, Hanieh Deilamsalehy, Franck Dernoncourt, Hassan Foroosh, and Fei Liu. 2023. MeetingBank: A Benchmark Dataset for Meeting Summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2023.acl-long.906

  6. [6]

    Soomin Kim, Seongyun Weon, Jinhwi Kim, and Hyunjoong Ko. 2023. ExplainMeetSum: An Explainable Meeting Summarization Benchmark. In Findings of the Association for Computational Linguistics: EMNLP

  7. [7]

    https://aclanthology.org/2023.findings-emnlp.573/

  8. [8]

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906-1919. Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.173/

  9. [9]

    and Hearst, Marti A

    Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. 2022. SummaC: Re-Visiting NLI- Based Models for Inconsistency Detection in Summarization. Transactions of the Association for Computational Linguistics, 10. https://doi.org/10.1162/tacl_a_00453

  10. [10]

    Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Julian Michael, Niloofar Mireshghallah, Khyathi Chandu, Eric Wallace, Emily Dinan, Ashish Sabharwal, and Adina Williams. 2021. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics...

  11. [11]

    Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2024. RAGAS: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150-158. https://aclanthology.org/2024.eacl-demo.16/

  12. [12]

    RAGAS Documentation. 2026. Metrics Overview. https://docs.ragas.io/en/stable/concepts/metrics/overview/. Accessed April 17, 2026

  13. [13]

    TruLens Documentation. 2026. Documentation Index. https://www.trulens.org/docs/. Accessed April 17, 2026

  14. [14]

    Confident AI Documentation. 2026. LLM Evaluation Documentation. https://www.confident-ai.com/docs. Accessed April 17, 2026