Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
super hub Mixed citations
Measuring Massive Multitask Language Understanding
Mixed citation behavior. Most common role is background (45%).
abstract
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models
authors
co-cited works
representative citing papers
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
MSQA benchmark shows LLMs exhibit cultural degradation and a locality effect where competence tracks pre-training exposure more than reasoning, and common inference-time fixes do not resolve it.
FlipGuard perturbs LLM weights prior to quantization to neutralize quantization-conditioned backdoor attacks, evaluated via the Defense Effectiveness Ratio on multiple models and quantization schemes.
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.
LLMs show high memorization capability under prefix attacks but low propensity under generic or dataset-specific prompts, with continual pre-training further reducing both.
Elmes* automates fine-grained rubric construction for LLM educational evaluation via multi-agent interactions and a self-evolving SceneGen module, producing the Edu-330 benchmark that demonstrates multidimensional differences in model teaching performance.
Introduces KINA benchmark with 899 items over 261 disciplines, formal (1-1/e) coverage guarantee and bonus-on-bar tournament theorem, plus evaluations of 42 models with top score 53.17%.
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
RealClawBench turns 281 real OpenClaw sessions into reproducible tasks that preserve the original distribution and shows the best of 14 models solves only 65.8 percent.
BigFinanceBench is a workflow-grounded benchmark of 928 financial research tasks with point-weighted rubrics, where the best of ten tested agents scores 58.8% on derivation quality.
A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
citing papers explorer
No citing papers match the current filters.