pith. sign in

arxiv: 2604.13583 · v2 · submitted 2026-04-15 · 💻 cs.CL · cs.AI

BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

Pith reviewed 2026-05-10 13:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords bengerlegalannotationbenchmarkcollaborativecreationend-to-endevaluation
0
0 comments X

The pith

BenGER integrates task design, annotation, LLM runs, and multi-metric evaluation into a single collaborative web platform for German legal benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BenGER as a solution to the fragmented nature of evaluating large language models for legal reasoning in German. Task design, expert annotation, model execution, and evaluation are typically handled separately, which hinders transparency and limits involvement from legal professionals who lack technical expertise. BenGER combines these steps in one open-source web platform that includes collaborative features, secure multi-organization support, and multiple types of evaluation metrics including lexical, semantic, factual, and judge-based ones. It also offers optional feedback to annotators based on references. The platform is demonstrated live to show its end-to-end capabilities for benchmark creation and analysis.

Core claim

We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators.

What carries the argument

The BenGER web platform that unifies task creation, collaborative annotation, configurable LLM runs, and multi-metric evaluation with tenant isolation and role-based access.

Load-bearing premise

Integrating all benchmarking steps into one web platform with security features will meaningfully improve transparency, reproducibility, and participation by non-technical legal experts over existing fragmented approaches.

What would settle it

A comparison study where users using BenGER show no difference in the ease of creating reproducible benchmarks or in the number of participating legal experts compared to using separate tools.

read the original abstract

Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents BenGER, an open-source web platform for end-to-end benchmarking of German legal tasks. It integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. The platform supports multi-organization projects with tenant isolation and role-based access control, and optionally provides formative, reference-grounded feedback to annotators. A live deployment demo is announced.

Significance. If the platform functions as described, it could meaningfully consolidate fragmented workflows in German legal NLP evaluation, improving transparency, reproducibility, and participation by non-technical experts through integrated tools and multi-tenant collaboration features. The open-source release and multi-metric evaluation support are notable strengths that could aid community adoption.

major comments (1)
  1. Abstract: The claim that BenGER integrates task creation, annotation, LLM execution, and multi-metric evaluation into a single platform with tenant isolation is presented without any implementation details, architecture overview, code references, or validation metrics, which is load-bearing for verifying the end-to-end functionality and workflow improvements.
minor comments (2)
  1. The manuscript would benefit from including at least one figure or table summarizing the platform's architecture or user roles to clarify the described features.
  2. No discussion of related benchmarking platforms (e.g., existing legal NLP tools or general annotation frameworks) is provided to situate the contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review of our manuscript on the BenGER platform. We address the single major comment below.

read point-by-point responses
  1. Referee: Abstract: The claim that BenGER integrates task creation, annotation, LLM execution, and multi-metric evaluation into a single platform with tenant isolation is presented without any implementation details, architecture overview, code references, or validation metrics, which is load-bearing for verifying the end-to-end functionality and workflow improvements.

    Authors: We acknowledge that the abstract is intentionally high-level and concise, as is conventional, and therefore omits the requested specifics. The full manuscript supplies these details: Section 3 presents the system architecture and tenant-isolation design; Section 4 describes the implementation stack, configurable LLM execution pipeline, and links to the open-source repository; Section 5 covers the lexical, semantic, factual, and judge-based metrics together with validation via the live demo. To address the referee's concern directly, we will revise the abstract to include brief references to the architecture overview, the public code repository, and the multi-metric evaluation framework while preserving its length. This change will make the end-to-end claim more verifiable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript is a system-description paper whose central claim is the existence and feature set of an open-source web platform (BenGER) that integrates task creation, annotation, LLM execution, and multi-metric evaluation for German legal benchmarks. No equations, derivations, fitted parameters, predictions, or load-bearing self-citations appear in the provided text. The description enumerates implemented capabilities without any step that reduces by construction to its own inputs or relies on an unverified self-referential premise. This is the expected honest outcome for a non-mathematical engineering/systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are involved because this is a description of a software platform and workflow integration rather than a theoretical derivation or empirical scientific claim with fitted values or postulates.

pith-pipeline@v0.9.0 · 5415 in / 1269 out tokens · 70670 ms · 2026-05-10T13:45:09.899118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    Yu Fan, Jingwei Ni, Jakob Merane, Etienne Salimbeni, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, and Joel Niklaus. 2025. LEXam: Benchmarking Legal Reasoning on 340 Law Exams. doi:10.48550/arXiv.2505.12864

  2. [2]

    Ho, Julian Nyarko, and Christopher Ré

    Neel Guha, Daniel E. Ho, Julian Nyarko, and Christopher Ré. 2022. LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning. arXiv:2209.06120 [cs]

  3. [3]

    Urs Kramer, Michael Granitzer, and Johann Graf Lambsdorff. 2024. DeepWrite: Annotation and Extraction of Legal Texts. https://extract-annotations.deepwrite. pads.fim.uni-passau.de/

  4. [4]

    Hiroki Nakayama, Takahiro Kubo, Junya Kamura, Yasufumi Taniguchi, and Xu Liang. 2018. doccano: Text Annotation Tool for Human. https://github.com/ doccano/doccano Software available from https://github.com/doccano/doccano

  5. [5]

    2020-2025

    Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Li- ubimov. 2020-2025. Label Studio: Data labeling software. https:// github.com/HumanSignal/label-studio Open source software available from https://github.com/HumanSignal/label-studio

  6. [6]

    Gijs van Dijck, Carlos Aguilera, Chris van der Lans, Shashank Chakravarthy, and Sander van Essel. 2022. Lawnotation: A Formal Language for Legal Rules. https://www.lawnotation.org