Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
super hub Mixed citations
Measuring Massive Multitask Language Understanding
Mixed citation behavior. Most common role is background (45%).
abstract
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models
authors
co-cited works
representative citing papers
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.
A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
FlipGuard perturbs LLM weights prior to quantization to neutralize quantization-conditioned backdoor attacks, evaluated via the Defense Effectiveness Ratio on multiple models and quantization schemes.
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.
SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.
For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
Block-size curriculum learning trains an 8B diffusion model to achieve competitive reasoning performance on math and code benchmarks by transitioning from small to large training block sizes.
Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
citing papers explorer
-
Sakana Fugu Technical Report
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
-
Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA
Empirical comparison across model families finds SFT often sufficient for MCQA while CPT aids OEQA metrics in French medical LLM adaptation, with cross-lingual transfer observed.
-
LLM Parameters for Math Across Languages: Shared or Separate?
Mechanistic analysis of LLMs finds partial overlap in math-associated parameters across languages, concentrated in middle layers, with systematic language-dependent differences.
-
Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs
A prompt perturbation approach builds comparison graphs from LLM judgments, filters inconsistent cycles or ties, and aggregates more reliable rankings.
-
ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning
ARMOR-MAD adds pre-debate routing, early stopping, and outlier detection to heterogeneous multi-agent debate, yielding accuracy gains on MATH, GSM8K, MMLU, and MMLU-Pro over fixed-round baselines.
-
Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning
Introduces the first structured pulmonary knowledge graph LungKG and uses it to train Lung-R1, which reaches SOTA on EMR-based pulmonary diagnosis tasks.
-
HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation
HERO converts environment observations after each turn into compact diagnoses to provide aligned feedback for self-distillation, improving success rates and reducing unnecessary actions on TauBench and WebShop compared to baselines.
-
Data-Driven Automation
Dynamic model of data-driven automation with heterogeneous accumulating data and spillovers derives conditions for partial versus full automation, shows asymptotic power-law decay in labor share, generic inefficiency, and with endogenous capital, explosive growth but stagnant long-run wages.
-
PriFT: Prior-Support Guided Supervised Fine-Tuning
PriFT uses token reweighting signals from a frozen pretrained model to stabilize SFT and achieve better results than standard SFT baselines on reasoning tasks.
-
Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning
SETA decomposes parameters into task-specific and shared sparse experts with adaptive anchoring and routing regularization to improve retention and backward transfer in LLM continual learning.
-
Steering LLM Viewpoints through Fabricated Evidence Injection
Ghostwriter attack injects fabricated evidence to steer LLM viewpoints, with experiments showing high success on commercial models and partial mitigation on guarded ones.
-
Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models
MLLMs show stochastic collapse with top-1 probabilities up to 97% and low randomness indices when choosing among equivalent options.
-
SCOPE: Real-Time Natural Language Camera Agent at the Edge
SCOPE introduces an edge-deployable natural-language PTZ camera agent, a simulation benchmark, and evaluations showing that stronger small language models reduce hallucinations while perception remains the main bottleneck.
-
Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction
Compression of LLMs often decouples accuracy from uncertainty, with larger models absorbing the effect better and inflation occurring in a threshold-like manner.
-
MESA: Improving MoE Safety Alignment via Decentralized Expertise
MESA decentralizes safety duties in MoE LLMs via expert capacity reallocation and dynamic routing refinement based on optimal transport theory, yielding robust defense on harmful benchmarks while preserving helpfulness.
-
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training
GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.
-
MixFP4: Enhancing NVFP4 with Adaptive FP4/INT4 Block Representations
MixFP4 extends NVFP4 by adaptively selecting between two FP4 micro-formats per block using repurposed scale sign bits and a unified E2M2 compute path, claiming better accuracy than standard NVFP4 at 3.1% area and 1.5% power overhead.
-
Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits
Toxic prompt perturbations reduce LLM factual accuracy on three benchmarks and selectively amplify perturbation-sensitive nodes in attribution graphs.
-
SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning
SLAT applies segment-level adaptive trimming in RL to reduce CoT reasoning length by 50% while maintaining competitive accuracy on benchmarks.
-
Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages
Fine-tuning a Spanish biomedical encoder on Gemini-generated synthetic data for multiple languages yields a bi-encoder that matches or exceeds BioBERT-ST on clinical code retrieval metrics, with further gains from cross-encoder reranking on most languages.
-
Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension
Empirical tests on four new frontier LLMs show cooperative equilibria favored in most balanced conditions, with provider identity correlating more strongly with outcomes than model generation.
-
Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information
JTS trains reasoning models via supervised warm-up and missing-premise RL to make an explicit answerability commitment that triggers early termination on unanswerable inputs, raising Abstention@Detection near saturation.
-
Learning to Assign Prediction Tasks to Agents with Capacity Constraints
Sequential explore-exploit algorithms for assigning tasks to capacity-constrained agents demonstrate performance gains over non-contextual baselines on tabular, image, and text tasks with both LLMs and humans.
-
GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization
GeneralThinker uses likelihood of ground-truth answers for dense answer-conditioned optimization and token-level credit assignment to improve language model reasoning without domain-specific verifiers.
-
Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM
Dynamic-dLLM achieves over 3x average inference speedup on dLLMs like LLaDA-8B via adaptive cache budgets and decoding thresholds while preserving benchmark performance.
-
KARMA: Karma-Aligned Reward Model Adaptation
KARMA adapts reward models from Reddit karma data to align LLMs with conversational pragmatics, finding that context-only rewards outperform karma-predictive ones downstream while reducing factuality across conditions.
-
Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling
Dense2MoE unifies pruning of attention modules with upcycling of MLPs into MoE experts to produce on-device LLMs that improve the latency-accuracy Pareto frontier.
-
Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay
Self-generated replay from language models nearly eliminates catastrophic forgetting during finetuning except when models are pretrained close to saturation.
-
ATOM: Instantiating Budget-Controllable Multi-Agent Collaboration via Nucleus-Electron Hierarchy
ATOM uses a nucleus-electron hierarchy and task-driven RL to generate budget-controllable multi-agent collaboration graphs for LLMs, claiming SOTA performance with up to 30% better token efficiency on six benchmarks.
-
ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services
ReLoRA reduces time-to-readiness for LoRA adapters on updated LLMs by up to 8.9x through adaptive Bayesian initialization and scheduled regularization while improving accuracy by up to 4.6%.
-
Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs
Palette identifies refusal directions via multi-objective search, internalizes them through lightweight adaptation, and supports on-demand multi-domain authorization via independent learning and parameter merging.
-
Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning
SymNoise applies symmetric noise to embeddings during instruction fine-tuning and reports 6.7% higher AlpacaEval scores than NEFTune on LLaMA-2-7B.
-
ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
A rule-based controller selects among FP16, quantized, speculative, and hybrid modes for single-GPU LLM inference, delivering 2.1x latency speedup and 51.7% lower energy per token with near-baseline accuracy on Llama-3.1-8B.
-
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation
A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.
-
SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules
SciCore-Mol augments LLMs with three integrated modules for molecular perception, latent diffusion generation, and reaction reasoning, claiming an 8B open model competes with or exceeds proprietary systems on chemical tasks.
-
AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback
AGPO adaptively sets trust-region size and exploration temperature from group reward dispersion, entropy, and KL drift, yielding higher scores than PPO and GRPO on nine math benchmarks under fixed token budget.
-
Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates
FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.
-
Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection
Adapts SPRT as a compute governor for multi-agent LLM debates using Beta-modeled consensus scores from an LLM judge, yielding 3.7x call reduction on GSM8K at -2pp accuracy versus fixed rounds.
-
Interactive Evaluation Requires a Design Science
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
-
Mixture of Experts for Low-Resource LLMs
Pre-trained MoE models exhibit deep-layer routing collapse for low-resource languages like Hebrew, largely corrected by continual pre-training on balanced bilingual data, with consistent patterns observed in Japanese.
-
D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning
D²Evo mines medium-difficulty anchors from the current model, trains a Questioner to generate matching questions, and jointly optimizes Solver and Questioner for progressive gains, outperforming baselines on math reasoning with under 2K real samples.
-
NGM: A Plug-and-Play Training-Free Memory Module for LLMs
NGM is a plug-and-play n-gram memory module that encodes n-grams from pretrained embeddings and gates their injection to improve LLM performance by 0.5-1.2 points on average across eight benchmarks.
-
Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study
Tokenizer fertility varies 1.6x across models on Ukrainian legal text, Qwen uses 60% more tokens than Llama-family models, zero-shot outperforms few-shot by up to 26 points, and pre-war classifiers lose 27.9 points on invasion-era decisions.
-
A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation
Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.
-
Do Linear Probes Generalize Better in Persona Coordinates?
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
-
Efficient Pre-Training with Token Superposition
Token-Superposition Training combines multiple tokens into bags for multi-hot cross-entropy pre-training followed by a recovery phase, yielding up to 2.5x reduction in training time at 10B scale under equal-loss conditions.
-
HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices
HCInfer recovers up to 5.2% accuracy over compressed LLMs and delivers 10.4x speedup versus full-precision models by offloading compensation parameters to CPU with async execution on resource-limited hardware.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.