Introduces the first community-governed unified JSON schema and crowdsourced repository for AI evaluation results, with converters and a database spanning 22,235 models and 2,273 benchmarks.
Title resolution pending
27 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Claw-SWE-Bench is a 350-instance multilingual benchmark for OpenClaw-style agent harnesses that shows adapter design raises Pass@1 from 19.1% to 73.4% on the same model while releasing data for reproducible comparison.
IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.
Curated 50-example subsets of LAM benchmarks, via regression, predict human preferences at 0.98 correlation, outperforming the full benchmark and yielding the open-sourced HUMANS proxy.
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
EduArt is a new benchmark of 871 educational questions that reveals multimodal LLMs perform near ceiling on multiple-choice art history items but drop sharply on open completion and error identification tasks.
OmniPilot combines conformal quantile regression with OOD detection to rank LLM serving configurations on mixed GPUs, reporting 6.2% MAPE throughput prediction and 95% top-1 accuracy on 460 benchmark runs while abstaining on unsupported cases.
AGC-Bench introduces a multi-domain creativity benchmark for LLMs, recovers a general 'c' factor explaining 81.5% of variance, and finds humans still outperform top models on matched tasks.
MMGist filters 23,250 items from 18 benchmarks down to 7,262 using three-stage pipeline, preserving model rankings (Spearman ρ=0.98) while cutting items 69% and raising discrimination 78%.
Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.
ProjQ constrains post-training quantization noise to a low-rank manifold through orthogonal subspace projection, enabling better compensation by LoRA adapters and preserving greater model plasticity than standard PTQ.
Models benchmarking as principal-agent game, derives welfare loss from welfare alignment, improvability and variance, and applies an audit framework to OLMES items.
Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
Bayesian deep learning method rankings are unreliable under data scarcity, reversing across datasets and sample sizes, and a hierarchical Bayesian framework with predictive detectability curves is needed to assess evaluation sufficiency.
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.
A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.
LMMS-EVAL delivers a standardized multimodal evaluation framework with lite and live variants that target the trade-offs among coverage, cost, and zero contamination.
Behavioral geometry of model populations enables high-accuracy jailbreak susceptibility prediction and defense transfer with 98% fewer evaluations.
MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
citing papers explorer
-
Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks
MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.