Title resolution pending

tinybenchmarks: evaluating llms with fewer examples · 2024 · arXiv 2402.14992

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

cs.AI · 2026-06-12 · unverdicted · novelty 7.0

Introduces the first community-governed unified JSON schema and crowdsourced repository for AI evaluation results, with converters and a database spanning 22,235 models and 2,273 benchmarks.

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

cs.LG · 2026-06-10 · conditional · novelty 7.0

Claw-SWE-Bench is a 350-instance multilingual benchmark for OpenClaw-style agent harnesses that shows adapter design raises Pass@1 from 19.1% to 73.4% on the same model while releasing data for reproducible comparison.

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.

Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

Curated 50-example subsets of LAM benchmarks, via regression, predict human preferences at 0.98 correlation, outperforming the full benchmark and yielding the open-sourced HUMANS proxy.

Activation Steering with a Feedback Controller

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

EduArt: An educational-level benchmark for evaluating art history knowledge in large language models

cs.CL · 2026-07-02 · unverdicted · novelty 6.0

EduArt is a new benchmark of 871 educational questions that reveals multimodal LLMs perform near ceiling on multiple-choice art history items but drop sharply on open completion and error identification tasks.

OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters

cs.DC · 2026-07-02 · unverdicted · novelty 6.0

OmniPilot combines conformal quantile regression with OOD detection to rank LLM serving configurations on mixed GPUs, reporting 6.2% MAPE throughput prediction and 95% top-1 accuracy on 460 benchmark runs while abstaining on unsupported cases.

AGC-Bench: Measuring Artificial General Creativity

cs.CL · 2026-07-01 · unverdicted · novelty 6.0 · 2 refs

AGC-Bench introduces a multi-domain creativity benchmark for LLMs, recovers a general 'c' factor explaining 81.5% of variance, and finds humans still outperform top models on matched tasks.

MMGist: A Comprehensive Multimodal Benchmark for 2027

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

MMGist filters 23,250 items from 18 benchmarks down to 7,262 using three-stage pipeline, preserving model rankings (Spearman ρ=0.98) while cutting items 69% and raising discrimination 78%.

Validity Threats for Foundation Model Research

cs.LG · 2026-06-03 · accept · novelty 6.0

Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.

ProjQ: Project-and-Quantize for Adapter-Aware LLM Compression

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

ProjQ constrains post-training quantization noise to a low-rank manifold through orthogonal subspace projection, enabling better compensation by LoRA adapters and preserving greater model plasticity than standard PTQ.

Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Models benchmarking as principal-agent game, derives welfare loss from welfare alignment, improvability and variance, and applies an audit framework to OLMES items.

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.

Minimizing Collateral Damage in Activation Steering

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.

Efficient Evaluation of LLM Performance with Statistical Guarantees

stat.ML · 2026-01-28 · unverdicted · novelty 6.0

Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

cs.CL · 2025-07-03 · unverdicted · novelty 6.0

A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

cs.CL · 2024-07-17 · unverdicted · novelty 6.0

LMMS-EVAL delivers a standardized multimodal evaluation framework with lite and live variants that target the trade-offs among coverage, cost, and zero contamination.

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

cs.CR · 2026-05-26 · unverdicted · novelty 5.0

Behavioral geometry of model populations enables high-accuracy jailbreak susceptibility prediction and defense transfer with 98% fewer evaluations.

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

cs.CL · 2025-09-29 · conditional · novelty 5.0

MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.

Small Language Models are the Future of Agentic AI

cs.AI · 2025-06-02 · unverdicted · novelty 5.0

Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.

LLM-Safety Evaluations Lack Robustness

cs.CR · 2025-03-04 · unverdicted · novelty 4.0

LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.

citing papers explorer

Showing 27 of 27 citing papers.

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results cs.AI · 2026-06-12 · unverdicted · none · ref 79
Introduces the first community-governed unified JSON schema and crowdsourced repository for AI evaluation results, with converters and a database spanning 22,235 models and 2,273 benchmarks.
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks cs.LG · 2026-06-10 · conditional · none · ref 25
Claw-SWE-Bench is a 350-instance multilingual benchmark for OpenClaw-style agent harnesses that shows adapter design raises Pass@1 from 19.1% to 73.4% on the same model while releasing data for reproducible comparison.
Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation cs.LG · 2026-05-29 · unverdicted · none · ref 21
IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.
Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment cs.CL · 2026-04-20 · unverdicted · none · ref 4
Curated 50-example subsets of LAM benchmarks, via regression, predict human preferences at 0.98 correlation, outperforming the full benchmark and yielding the open-sourced HUMANS proxy.
Activation Steering with a Feedback Controller cs.LG · 2025-10-05 · unverdicted · none · ref 19
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 173
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
EduArt: An educational-level benchmark for evaluating art history knowledge in large language models cs.CL · 2026-07-02 · unverdicted · none · ref 36
EduArt is a new benchmark of 871 educational questions that reveals multimodal LLMs perform near ceiling on multiple-choice art history items but drop sharply on open completion and error identification tasks.
OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters cs.DC · 2026-07-02 · unverdicted · none · ref 12
OmniPilot combines conformal quantile regression with OOD detection to rank LLM serving configurations on mixed GPUs, reporting 6.2% MAPE throughput prediction and 95% top-1 accuracy on 460 benchmark runs while abstaining on unsupported cases.
AGC-Bench: Measuring Artificial General Creativity cs.CL · 2026-07-01 · unverdicted · none · ref 36 · 2 links
AGC-Bench introduces a multi-domain creativity benchmark for LLMs, recovers a general 'c' factor explaining 81.5% of variance, and finds humans still outperform top models on matched tasks.
MMGist: A Comprehensive Multimodal Benchmark for 2027 cs.CV · 2026-06-21 · unverdicted · none · ref 6
MMGist filters 23,250 items from 18 benchmarks down to 7,262 using three-stage pipeline, preserving model rankings (Spearman ρ=0.98) while cutting items 69% and raising discrimination 78%.
Validity Threats for Foundation Model Research cs.LG · 2026-06-03 · accept · none · ref 78
Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs cs.CL · 2026-05-31 · unverdicted · none · ref 21
A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.
ProjQ: Project-and-Quantize for Adapter-Aware LLM Compression cs.LG · 2026-05-30 · unverdicted · none · ref 70
ProjQ constrains post-training quantization noise to a low-rank manifold through orthogonal subspace projection, enabling better compensation by LoRA adapters and preserving greater model plasticity than standard PTQ.
Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation cs.LG · 2026-05-29 · unverdicted · none · ref 37
Models benchmarking as principal-agent game, derives welfare loss from welfare alignment, improvability and variance, and applies an audit framework to OLMES items.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 14
Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.
Minimizing Collateral Damage in Activation Steering cs.LG · 2026-05-01 · unverdicted · none · ref 25
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees cs.AI · 2026-04-13 · unverdicted · none · ref 21
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Efficient Evaluation of LLM Performance with Statistical Guarantees stat.ML · 2026-01-28 · unverdicted · none · ref 15
Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.
LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users cs.CL · 2025-07-03 · unverdicted · none · ref 26
A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models cs.CL · 2024-07-17 · unverdicted · none · ref 7
LMMS-EVAL delivers a standardized multimodal evaluation framework with lite and live variants that target the trade-offs among coverage, cost, and zero contamination.
Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models cs.CR · 2026-05-26 · unverdicted · none · ref 32
Behavioral geometry of model populations enables high-accuracy jailbreak susceptibility prediction and defense transfer with 98% fewer evaluations.
Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks cs.CL · 2025-09-29 · conditional · none · ref 21
MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.
Small Language Models are the Future of Agentic AI cs.AI · 2025-06-02 · unverdicted · none · ref 64
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
LLM-Safety Evaluations Lack Robustness cs.CR · 2025-03-04 · unverdicted · none · ref 43
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation cs.CL · 2025-02-13 · unverdicted · none · ref 2
A hierarchical statistical model demonstrates that multiple LLM generations per prompt improve benchmark score accuracy, reduce variance, and enable prompt-level difficulty scoring via correct ratios.
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation cs.LG · 2026-04-25 · unreviewed · ref 59
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unreviewed · ref 125

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer