pith. machine review for the scientific record. sign in

arxiv: 2509.26184 · v5 · submitted 2025-09-30 · 💻 cs.IR · cs.AI· cs.CL

Recognition: unknown

Auto-ARGUE: LLM-Based Report Generation Evaluation

Authors on Pith no claims yet
classification 💻 cs.IR cs.AIcs.CL
keywords generationauto-arguereportevaluationanalysisjudgmentsllm-basedtasks
0
0 comments X
read the original abstract

Generation of citation-backed reports is a primary use case for retrieval-augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, tools designed for report generation are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for report generation evaluation. We present analysis of Auto-ARGUE on the report generation pilot task from the TREC 2024 NeuCLIR track and on two tasks from the TREC 2024 RAG track, showing good system-level correlations with human judgments. Additionally, we release ARGUE-Viz, a web app for visualization and fine-grained analysis of Auto-ARGUE judgments and scores.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation

    cs.CL 2026-05 unverdicted novelty 6.0

    DoGMaTiQ automates QA-nugget creation via document-grounded generation, paraphrase clustering, and quality-based subselection, yielding strong rank correlations with human judgments on cross-lingual TREC tasks.

  2. Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization

    cs.DC 2026-04 unverdicted novelty 5.0

    BloomBee is a distributed LLM inference system that achieves up to 1.76x higher throughput and 43.2% lower latency than prior decentralized systems by optimizing communication across multiple dimensions in low-bandwid...