BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks
Pith reviewed 2026-05-10 13:45 UTC · model grok-4.3
The pith
BenGER integrates task design, annotation, LLM runs, and multi-metric evaluation into a single collaborative web platform for German legal benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators.
What carries the argument
The BenGER web platform that unifies task creation, collaborative annotation, configurable LLM runs, and multi-metric evaluation with tenant isolation and role-based access.
Load-bearing premise
Integrating all benchmarking steps into one web platform with security features will meaningfully improve transparency, reproducibility, and participation by non-technical legal experts over existing fragmented approaches.
What would settle it
A comparison study where users using BenGER show no difference in the ease of creating reproducible benchmarks or in the number of participating legal experts compared to using separate tools.
read the original abstract
Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents BenGER, an open-source web platform for end-to-end benchmarking of German legal tasks. It integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. The platform supports multi-organization projects with tenant isolation and role-based access control, and optionally provides formative, reference-grounded feedback to annotators. A live deployment demo is announced.
Significance. If the platform functions as described, it could meaningfully consolidate fragmented workflows in German legal NLP evaluation, improving transparency, reproducibility, and participation by non-technical experts through integrated tools and multi-tenant collaboration features. The open-source release and multi-metric evaluation support are notable strengths that could aid community adoption.
major comments (1)
- Abstract: The claim that BenGER integrates task creation, annotation, LLM execution, and multi-metric evaluation into a single platform with tenant isolation is presented without any implementation details, architecture overview, code references, or validation metrics, which is load-bearing for verifying the end-to-end functionality and workflow improvements.
minor comments (2)
- The manuscript would benefit from including at least one figure or table summarizing the platform's architecture or user roles to clarify the described features.
- No discussion of related benchmarking platforms (e.g., existing legal NLP tools or general annotation frameworks) is provided to situate the contribution.
Simulated Author's Rebuttal
We thank the referee for their constructive review of our manuscript on the BenGER platform. We address the single major comment below.
read point-by-point responses
-
Referee: Abstract: The claim that BenGER integrates task creation, annotation, LLM execution, and multi-metric evaluation into a single platform with tenant isolation is presented without any implementation details, architecture overview, code references, or validation metrics, which is load-bearing for verifying the end-to-end functionality and workflow improvements.
Authors: We acknowledge that the abstract is intentionally high-level and concise, as is conventional, and therefore omits the requested specifics. The full manuscript supplies these details: Section 3 presents the system architecture and tenant-isolation design; Section 4 describes the implementation stack, configurable LLM execution pipeline, and links to the open-source repository; Section 5 covers the lexical, semantic, factual, and judge-based metrics together with validation via the live demo. To address the referee's concern directly, we will revise the abstract to include brief references to the architecture overview, the public code repository, and the multi-metric evaluation framework while preserving its length. This change will make the end-to-end claim more verifiable from the abstract alone. revision: yes
Circularity Check
No significant circularity detected
full rationale
The manuscript is a system-description paper whose central claim is the existence and feature set of an open-source web platform (BenGER) that integrates task creation, annotation, LLM execution, and multi-metric evaluation for German legal benchmarks. No equations, derivations, fitted parameters, predictions, or load-bearing self-citations appear in the provided text. The description enumerates implemented capabilities without any step that reduces by construction to its own inputs or relies on an unverified self-referential premise. This is the expected honest outcome for a non-mathematical engineering/systems contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Yu Fan, Jingwei Ni, Jakob Merane, Etienne Salimbeni, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, and Joel Niklaus. 2025. LEXam: Benchmarking Legal Reasoning on 340 Law Exams. doi:10.48550/arXiv.2505.12864
-
[2]
Ho, Julian Nyarko, and Christopher Ré
Neel Guha, Daniel E. Ho, Julian Nyarko, and Christopher Ré. 2022. LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning. arXiv:2209.06120 [cs]
-
[3]
Urs Kramer, Michael Granitzer, and Johann Graf Lambsdorff. 2024. DeepWrite: Annotation and Extraction of Legal Texts. https://extract-annotations.deepwrite. pads.fim.uni-passau.de/
work page 2024
-
[4]
Hiroki Nakayama, Takahiro Kubo, Junya Kamura, Yasufumi Taniguchi, and Xu Liang. 2018. doccano: Text Annotation Tool for Human. https://github.com/ doccano/doccano Software available from https://github.com/doccano/doccano
work page 2018
- [5]
-
[6]
Gijs van Dijck, Carlos Aguilera, Chris van der Lans, Shashank Chakravarthy, and Sander van Essel. 2022. Lawnotation: A Formal Language for Legal Rules. https://www.lawnotation.org
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.