{"work":{"id":"7255b223-8380-468c-9951-e1617432eb73","openalex_id":null,"doi":null,"arxiv_id":"2410.12784","raw_key":null,"title":"JudgeBench: A Benchmark for Evaluating LLM-based Judges","authors":null,"authors_text":"Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro Cuadron, Chenguang Wang","year":2024,"venue":"cs.AI","abstract":"LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench.","external_url":"https://arxiv.org/abs/2410.12784","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-23T17:35:44.147496+00:00","pith_arxiv_id":"2410.12784","created_at":"2026-05-10T07:32:00.280840+00:00","updated_at":"2026-05-23T17:35:44.147496+00:00","title_quality_ok":true,"display_title":"JudgeBench: A Benchmark for Evaluating LLM-based Judges","render_title":"JudgeBench: A Benchmark for Evaluating LLM-based Judges"},"hub":{"state":{"work_id":"7255b223-8380-468c-9951-e1617432eb73","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":19,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2024-11-23T16:03:35+00:00","last_pith_cited_at":"2026-05-19T00:47:02+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-03T02:55:32.468947+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":5},{"context_role":"dataset","n":3}],"polarity_counts":[{"context_polarity":"background","n":6},{"context_polarity":"use_dataset","n":2}],"runs":{},"summary":{},"graph":{},"authors":[]}}