Local LLMs answer 88.7% of 1M real-world queries with IPW improving 5.3x from 2023-2025, indicating local inference can handle most queries efficiently on power-constrained devices.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
JudgeBench converts difficult datasets into objective-correctness response pairs to evaluate LLM judges, showing even strong models like GPT-4o perform near random guessing.
BenchBuilder automates extraction of 500 challenging prompts from crowd-sourced data into Arena-Hard-Auto, delivering 3x model separation versus MT-Bench and 98.6% human-preference correlation at $20 cost.
Chatbot Arena is an open platform that ranks LLMs using crowdsourced pairwise human votes, with statistical methods and validation showing diverse questions and agreement with expert raters.
DIR applies an information bottleneck to reward model training to mitigate complex inductive biases such as length, sycophancy, and format, with claimed improvements in RLHF generalization.
citing papers explorer
-
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Local LLMs answer 88.7% of 1M real-world queries with IPW improving 5.3x from 2023-2025, indicating local inference can handle most queries efficiently on power-constrained devices.
-
JudgeBench: A Benchmark for Evaluating LLM-based Judges
JudgeBench converts difficult datasets into objective-correctness response pairs to evaluate LLM judges, showing even strong models like GPT-4o perform near random guessing.
-
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
BenchBuilder automates extraction of 500 challenging prompts from crowd-sourced data into Arena-Hard-Auto, delivering 3x model separation versus MT-Bench and 98.6% human-preference correlation at $20 cost.
-
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Chatbot Arena is an open platform that ranks LLMs using crowdsourced pairwise human votes, with statistical methods and validation showing diverse questions and agreement with expert raters.
-
Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance
DIR applies an information bottleneck to reward model training to mitigate complex inductive biases such as length, sycophancy, and format, with claimed improvements in RLHF generalization.