{"paper":{"title":"BenGER Platform: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks","license":"http://creativecommons.org/licenses/by/4.0/","headline":"BenGER integrates task design, annotation, LLM runs, and multi-metric evaluation into a single collaborative web platform for German legal benchmarks.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Matthias Grabmair, Sebastian Nagl","submitted_at":"2026-04-15T07:43:01Z","abstract_excerpt":"Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization pr"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That integrating task design, annotation, LLM execution, and multi-metric evaluation into a single web platform with tenant isolation and role-based access will meaningfully improve transparency, reproducibility, and participation by non-technical legal experts compared to existing split workflows.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"BenGER is a collaborative web platform that integrates end-to-end workflows for creating, annotating, running, and evaluating benchmarks on German legal tasks with large language models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"BenGER integrates task design, annotation, LLM runs, and multi-metric evaluation into a single collaborative web platform for German legal benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8a1085bcd9e26c9a5be4b5c6e525e701894fa8b0830ff55ef7db2fa7235986f8"},"source":{"id":"2604.13583","kind":"arxiv","version":3},"verdict":{"id":"93fa6909-6939-45d0-ae55-516d5164ae3b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-10T13:45:09.899118Z","strongest_claim":"We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators.","one_line_summary":"BenGER is a collaborative web platform that integrates end-to-end workflows for creating, annotating, running, and evaluating benchmarks on German legal tasks with large language models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That integrating task design, annotation, LLM execution, and multi-metric evaluation into a single web platform with tenant isolation and role-based access will meaningfully improve transparency, reproducibility, and participation by non-technical legal experts compared to existing split workflows.","pith_extraction_headline":"BenGER integrates task design, annotation, LLM runs, and multi-metric evaluation into a single collaborative web platform for German legal benchmarks."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2604.13583/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}