Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
super hub Mixed citations
Measuring Massive Multitask Language Understanding
Mixed citation behavior. Most common role is background (45%).
abstract
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models
authors
co-cited works
representative citing papers
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.
A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
FlipGuard perturbs LLM weights prior to quantization to neutralize quantization-conditioned backdoor attacks, evaluated via the Defense Effectiveness Ratio on multiple models and quantization schemes.
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.
SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.
For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
Block-size curriculum learning trains an 8B diffusion model to achieve competitive reasoning performance on math and code benchmarks by transitioning from small to large training block sizes.
Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
citing papers explorer
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving
JSTIP interleaves speech and text sequences during pretraining on 38k hours of ASR data to improve entity accuracy over ASR-only and simple joint-training baselines while matching performance from domain text.
-
$\text{Log}_\text{b}$Quant: Quantizing Language Models in Logarithmic Space
Log_b Quant is an adjustable-base logarithmic quantization technique that outperforms tensor-wise asymmetric linear quantization at 4-bit precision on language model benchmarks while providing memory savings.
-
At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization
Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.
-
From Question Answering to Task Completion: A Survey on Agent System and Harness Design
Survey framing LLM agents as model-plus-harness systems, decomposing harness responsibilities, mapping them to tasks, and highlighting open challenges in evaluation, safety, and co-evolution.
-
Agent Skill Evaluation and Evolution: Frameworks and Benchmarks
The paper surveys skill evolution frameworks in agentic systems, grouping them into execution feedback, trajectory distillation, compression, and reinforcement learning paradigms while analyzing gaps across six benchmark categories.
-
Distilling Safe LLM Systems via Soft Prompts for On Device Settings
Soft prompt distillation with total variation and KL divergence transfers safety behaviors from guard models to on-device LLMs and outperforms LoRA adapters, steering vectors, and direct optimization in safety-usefulness trade-offs with minimal inference cost.
-
ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning
ConSteer-RL adds a confidence-aware reward derived from per-token probabilities to GRPO-based RLVR and reports 2.3-4% average gains over baselines across model scales.
-
Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models
HERALD selectively encrypts sensitive tokens via medical NER, POS policies, and deterministic ciphertext substitution to enable privacy-preserving clinical LLM use while recovering near-plaintext task performance.
-
The Shape of Wisdom: Decision Trajectories in Language Models
A 9,000-trajectory study across three LLMs finds correctness and stability differ, with the largest group unstable-correct and attention scalars aligning better than MLPs in stable cases.
-
MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution
MetaEvo is a two-stage framework using preference optimization for principle abstraction followed by modular reuse to enable continual improvement of LLM agents on reasoning tasks.
-
Mind Your Tone: Does Tone Alter LLM Performance?
Tonal variations in prompts cause systematic but model-dependent accuracy changes in LLMs on objective multiple-choice questions.
-
AI-Model Network: Concept, Current State and Future
The paper introduces the concept, vision, and hierarchical architecture of a worldwide AI-model network (AI-ModelNet) for model interconnection, sharing, and collaboration, validated via a prototype.
-
MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional
MDIA, a specialty-routed 7-node multi-agent system, reports 0.6272 accuracy on 525 HealthBench Professional cases using GPT-5.4, outperforming the ChatGPT for Clinicians baseline by 3.72 points and attributing the lift to architectural components.
-
Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play
Gemini-3.1-pro-preview won 20 of 32 Risk games through superior objective tracking and execution conversion, while a hybrid test with fixed execution showed near-equal planner performance across providers.
-
Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning
QLoRA fine-tuning on tool-use data enables 4B-parameter models to perform structured planning without tool catalogs in prompts, outperforming informed baselines on AssetOpsBench while reducing input length by 82.6%.
-
SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning
SEMA-RAG is a three-agent self-evolving RAG system that reports an average 6.46-point accuracy gain over the strongest baseline across five medical QA benchmarks and five LLM backbones.
-
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
-
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.
-
Gyan: An Explainable Neuro-Symbolic Language Model
Gyan is a novel explainable non-transformer language model that achieves SOTA results on multiple datasets by mimicking human-like compositional context and world models.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
-
Born-Qualified: An Autonomous Framework for Deploying Advanced Energy and Electronic Materials
A conceptual framework called born-qualified autonomous development embeds industrial viability constraints into the materials discovery process for energy and electronics.
-
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.
-
MedGemma 1.5 Technical Report
MedGemma 1.5 4B reports absolute gains of 11% on 3D MRI classification, 3% on 3D CT, 47% macro F1 on pathology slides, 35% IoU on anatomical localization, and 5-22% on clinical QA tasks over MedGemma 1.
-
Domain Adaptation of Large Language Models for Polymer-Composite Additive Manufacturing Using Retrieval-Augmented Generation and Fine-Tuning
RAG-adapted LLaMA-3-8B outperforms both baseline and fine-tuned models on expert-rated accuracy (75.5%), relevance (90.8%), and overall preference (85.2%) for additive manufacturing questions.
-
Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning
Fine-tuning LLMs on multi-source synthetic data mitigates distribution collapse and self-preference bias while increasing output quality relative to single-source or human-only fine-tuning.
-
Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.
-
Phi-4-reasoning Technical Report
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
-
Gemma 3 Technical Report
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
-
LLM-Safety Evaluations Lack Robustness
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation
A hierarchical statistical model demonstrates that multiple LLM generations per prompt improve benchmark score accuracy, reduce variance, and enable prompt-level difficulty scoring via correct ratios.
-
Prioritizing High-Consequence Biological Capabilities in Evaluations of Artificial Intelligence Models
AI model evaluations for biological capabilities should prioritize high-consequence risks like pandemics, informed by life sciences dual-use experience, and occur prior to deployment to enable biosafety measures.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
The paper surveys LLM-based multi-agent systems, covering simulated domains, agent profiling and communication, mechanisms for capacity growth, and common benchmarks.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
-
Think in English, Answer in Korean: Efficient Adaptation of Multilingual Tool-Using Agents
LuckyStar 111B adapts Cohere's Command A model with four scaling techniques to improve tool-use, math reasoning, and NL2SQL in Korean-English while preserving general instruction following.
-
Customized Generative AI Agent for Transportation Engineering Practice: A Development and Continued Pre-training Guideline
A framework is described for adapting six LLMs to transportation engineering via LoRA-based continued pretraining on domain documents, with two models showing strongest results on BLEU-4 and ROUGE metrics.
-
Token-Operations-Oriented Inference Optimization Techniques for Large Models
The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.
-
Mellum2 Technical Report
Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.
-
Apertus LLM Family Expansion via Distillation and Quantization
Distillation and quantization expand the Apertus 8B LLM into a family of models up to 4B parameters with claimed strong accuracy and cost efficiency.
-
OpenCompass: A Universal Evaluation Platform for Large Language Models
OpenCompass is presented as a one-stop, scalable, high-concurrency LLM evaluation platform with modular architecture supporting multiple domains and evaluator types.
-
Phoenix-VL 1.5 Medium Technical Report
Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying competitive on global benchmarks.
-
Development and Preliminary Evaluation of a Domain-Specific Large Language Model for Tuberculosis Care in South Africa
A domain-specific LLM for TB care in South Africa, created by fine-tuning BioMistral-7B with QLoRA and GraphRAG on local guidelines, shows improved contextual alignment over the base model.
-
When control meets large language models: From words to dynamics
The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.
-
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
-
Survey on Evaluation of LLM-based Agents
A survey of evaluation methods for LLM-based agents from five perspectives, identifying trends toward realistic benchmarks and gaps in safety, cost-efficiency, and robustness.