Title resolution pending

Assistant A is significantly better: [[A>>B]]

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

cs.DC · 2025-11-11 · unverdicted · novelty 6.0

Local LLMs answer 88.7% of 1M real-world queries with IPW improving 5.3x from 2023-2025, indicating local inference can handle most queries efficiently on power-constrained devices.

JudgeBench: A Benchmark for Evaluating LLM-based Judges

cs.AI · 2024-10-16 · unverdicted · novelty 6.0

JudgeBench converts difficult datasets into objective-correctness response pairs to evaluate LLM judges, showing even strong models like GPT-4o perform near random guessing.

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

BenchBuilder automates extraction of 500 challenging prompts from crowd-sourced data into Arena-Hard-Auto, delivering 3x model separation versus MT-Bench and 98.6% human-preference correlation at $20 cost.

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

cs.AI · 2024-03-07 · unverdicted · novelty 6.0

Chatbot Arena is an open platform that ranks LLMs using crowdsourced pairwise human votes, with statistical methods and validation showing diverse questions and agreement with expert raters.

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

cs.LG · 2025-12-29 · unverdicted · novelty 5.0

DIR applies an information bottleneck to reward model training to mitigate complex inductive biases such as length, sycophancy, and format, with claimed improvements in RLHF generalization.

citing papers explorer

Showing 5 of 5 citing papers.

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI cs.DC · 2025-11-11 · unverdicted · none · ref 12
Local LLMs answer 88.7% of 1M real-world queries with IPW improving 5.3x from 2023-2025, indicating local inference can handle most queries efficiently on power-constrained devices.
JudgeBench: A Benchmark for Evaluating LLM-based Judges cs.AI · 2024-10-16 · unverdicted · none · ref 6
JudgeBench converts difficult datasets into objective-correctness response pairs to evaluate LLM judges, showing even strong models like GPT-4o perform near random guessing.
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline cs.LG · 2024-06-17 · unverdicted · none · ref 10
BenchBuilder automates extraction of 500 challenging prompts from crowd-sourced data into Arena-Hard-Auto, delivering 3x model separation versus MT-Bench and 98.6% human-preference correlation at $20 cost.
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference cs.AI · 2024-03-07 · unverdicted · none · ref 37
Chatbot Arena is an open platform that ranks LLMs using crowdsourced pairwise human votes, with statistical methods and validation showing diverse questions and agreement with expert raters.
Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance cs.LG · 2025-12-29 · unverdicted · none · ref 2
DIR applies an information bottleneck to reward model training to mitigate complex inductive biases such as length, sycophancy, and format, with claimed improvements in RLHF generalization.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer