super hub Mixed citations

Measuring Massive Multitask Language Understanding

Andy Zou, Collin Burns, Dan Hendrycks, Dawn Song, Mantas Mazeika, Steven Basart · 2020 · cs.CY · arXiv 2009.03300

Mixed citation behavior. Most common role is background (44%).

508 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 508 citing papers more from Andy Zou arXiv PDF

abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 dataset 30 method 5 baseline 3

citation-polarity summary

background 30 use dataset 28 use method 5 baseline 3 unclear 2

claims ledger

abstract We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models

authors

Andy Zou Collin Burns Dan Hendrycks Dawn Song Mantas Mazeika Steven Basart

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

Will Scaling Improve Social Simulation with LLMs?

cs.CL · 2026-07-02 · conditional · novelty 7.0

Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

FlipGuard: Defending Large Language Models Against Quantization-Conditioned Backdoor Attacks

cs.CR · 2026-06-27 · unverdicted · novelty 7.0

FlipGuard perturbs LLM weights prior to quantization to neutralize quantization-conditioned backdoor attacks, evaluated via the Defense Effectiveness Ratio on multiple models and quantization schemes.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

cs.LG · 2026-06-16 · unverdicted · novelty 7.0

Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

cs.AI · 2026-06-16 · unverdicted · novelty 7.0

CEO-Bench evaluates LLMs on CEO-level strategic resource reallocation via multi-role agent simulations, showing high structural validity but sharp divergence on strategic calibration across five frontier models on 13 scenarios.

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 7.0

SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.

HARP: Efficient Data Selection for Finetuning Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

LLMs show high memorization capability under prefix attacks but low propensity under generic or dataset-specific prompts, with continual pre-training further reducing both.

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Elmes* automates fine-grained rubric construction for LLM educational evaluation via multi-agent interactions and a self-evolving SceneGen module, producing the Edu-330 benchmark that demonstrates multidimensional differences in model teaching performance.

citing papers explorer

Showing 50 of 508 citing papers.

Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 32 · internal anchor
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.
RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents cs.CL · 2026-04-13 · unverdicted · none · ref 11 · internal anchor
RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game showing smaller instruction-tuned models can be more consistent than larger ones.
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees cs.AI · 2026-04-13 · unverdicted · none · ref 45 · internal anchor
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs cs.LG · 2026-04-12 · unverdicted · none · ref 19 · internal anchor
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts cs.CL · 2026-04-09 · conditional · none · ref 32 · internal anchor
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.
Sensitivity-Positional Co-Localization in GQA Transformers cs.CL · 2026-04-09 · unverdicted · none · ref 17 · internal anchor
In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU, GPQA, HumanEval+, MATH, MGSM and ARC.
Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge cs.DC · 2026-04-08 · unverdicted · none · ref 13 · internal anchor
ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, improving accuracy by up to 46.46%.
In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 28 · internal anchor
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents cs.CL · 2026-04-07 · unverdicted · none · ref 3 · internal anchor
JailAgent red-teams LLM agents by hijacking reasoning trajectories and tightening constraints without prompt changes, claiming strong cross-model and cross-scenario performance.
Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting cs.CL · 2026-04-07 · unverdicted · none · ref 2 · internal anchor
CoT2Edit trains LLMs to reason over edited knowledge using agent-generated CoTs, SFT, GRPO, and RAG, achieving generalization across six editing scenarios on three models.
Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation cs.IR · 2026-04-07 · unverdicted · none · ref 15 · internal anchor
STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures cs.LG · 2026-04-04 · conditional · none · ref 5 · internal anchor
Gradient-guided layer selection for LoRA yields 15-28% training speedup with matched downstream results on MMLU, GSM8K, and HumanEval across 14 models from 0.5B to 72B parameters.
Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization cs.LG · 2026-03-09 · unverdicted · none · ref 14 · internal anchor
CAMEL is a scaling law capturing nonlinear model-size and mixture interactions to extrapolate optimal data mixtures for large LLMs from small-model experiments, reducing optimization cost by 50% and improving benchmarks by up to 3%.
Data Agent: Learning to Select Data via End-to-End Dynamic Optimization cs.LG · 2026-03-08 · unverdicted · none · ref 5 · internal anchor
Data Agent learns a co-evolving sample selection policy end-to-end that accelerates training by over 50% on ImageNet-1k and MMLU with no performance loss.
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation cs.LG · 2026-03-05 · conditional · none · ref 8 · internal anchor
CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration cs.MA · 2026-03-04 · unverdicted · none · ref 20 · internal anchor
A graph-based propagation model for error cascades in LLM multi-agent systems plus a genealogy-graph governance plugin that prevents final infection in at least 89% of runs across tested frameworks.
BarrierSteer: LLM Safety via Learning Barrier Steering cs.LG · 2026-02-23 · unverdicted · none · ref 11 · internal anchor
BarrierSteer applies control barrier functions to LLM latent states for constraint-guided steering that reduces unsafe generations while preserving utility.
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing cs.LG · 2026-02-17 · unverdicted · none · ref 11 · internal anchor
CrispEdit edits LLMs via low-curvature projections using Bregman divergence and K-FAC approximations, achieving high edit success with under 1% average capability degradation.
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens cs.CV · 2026-02-12 · unverdicted · none · ref 22 · internal anchor
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning cs.CL · 2026-02-10 · unverdicted · none · ref 4 · internal anchor
ATTNPO guides process-supervised RL with intrinsic attention signals to shorten reasoning traces while raising accuracy on nine benchmarks.
Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task cs.CL · 2026-02-06 · unverdicted · none · ref 9 · internal anchor
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
FBS: Modeling Native Parallel Reading inside a Transformer cs.AI · 2026-01-29 · unverdicted · none · ref 2 · internal anchor
FBS introduces a causal trainable loop via PAW, CH, and SG modules to model native parallel reading in Transformers, yielding better quality-efficiency on benchmarks with complementary ablations.
Collaborative Parameter Learning: Mitigating Forgetting via Parameter-Level Gradient Analysis cs.LG · 2026-01-29 · conditional · none · ref 6 · internal anchor
Collaborative Parameter Learning freezes 50-75% of parameters whose updates cause forgetting and updates only the 25-50% that mitigate it, allowing LLMs to learn 20-48% more new questions with negligible forgetting and lower compute cost.
Sentipolis: Emotion-Aware Agents for Social Simulations cs.AI · 2026-01-25 · unverdicted · none · ref 1 · internal anchor
Sentipolis equips LLM agents with continuous PAD emotional states, dual-speed dynamics, and memory coupling to improve emotional continuity and grounded behavior in social simulations.
Tracing Moral Foundations in Large Language Models cs.CL · 2026-01-09 · unverdicted · none · ref 4 · 2 links · internal anchor
LLMs encode moral foundations in human-aligned, layered representations that arise from pretraining and can be steered via dense vectors or sparse SAE features.
ALL-FEM: Agentic Large Language models Fine-tuned for Finite Element Methods cs.CE · 2026-01-08 · unverdicted · none · ref 48 · internal anchor
ALL-FEM fine-tunes LLMs on a corpus of verified FEniCS scripts and uses multi-agent workflows to automate finite element code generation, achieving 71.79% success on 39 benchmarks across elasticity, flow, and coupled problems.
mHC: Manifold-Constrained Hyper-Connections cs.CL · 2025-12-31 · unverdicted · none · ref 37 · internal anchor
mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.
Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models cs.CR · 2025-12-21 · unverdicted · none · ref 3 · internal anchor
A meta-prompt and hierarchical detection framework automates LLM red-teaming, achieving 3.9 times higher vulnerability discovery rate than manual methods with 89% accuracy on GPT-OSS-20B.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B cs.LG · 2025-12-10 · conditional · none · ref 12 · internal anchor
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs cs.AI · 2025-12-09 · unverdicted · none · ref 16 · internal anchor
State-of-the-art MLLMs show substantial inconsistency when reasoning over the same information presented in image, text, or mixed modalities, even after accounting for OCR errors, with inconsistency linked to visual factors and modality gap.
TRINITY: An Evolved LLM Coordinator cs.LG · 2025-12-04 · unverdicted · none · ref 7 · internal anchor
A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory cs.CL · 2025-11-25 · unverdicted · none · ref 1 · internal anchor
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
The Impact of Off-Policy Training Data on Probe Generalisation cs.AI · 2025-11-21 · unverdicted · none · ref 16 · internal anchor
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 37 · internal anchor
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Dr.LLM: Dynamic Layer Routing in LLMs cs.CL · 2025-10-14 · unverdicted · none · ref 8 · internal anchor
Dr. LLM retrofits frozen LLMs with MCTS-supervised per-layer routers for skip/execute/repeat decisions, delivering up to +3.4% accuracy and 5-layer savings on reasoning tasks with strong out-of-domain generalization.
LightReasoner: Can Small Language Models Teach Large Language Models Reasoning? cs.CL · 2025-10-09 · unverdicted · none · ref 6 · internal anchor
LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference cs.DC · 2025-10-07 · conditional · none · ref 23 · internal anchor
Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 6 · internal anchor
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning cs.LG · 2025-10-01 · conditional · none · ref 52 · internal anchor
Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
Artificial Phantasia: Emergent Mental Imagery in Large Language Models cs.AI · 2025-09-27 · unverdicted · none · ref 23 · internal anchor
LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.
Multiplayer Nash Preference Optimization cs.AI · 2025-09-27 · unverdicted · none · ref 12 · internal anchor
MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.
Dream 7B: Diffusion Large Language Models cs.CL · 2025-08-21 · unverdicted · none · ref 10 · internal anchor
Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and quality-speed tradeoffs.
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling cs.CL · 2025-08-12 · unverdicted · none · ref 16 · internal anchor
InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL benchmark.
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models cs.AI · 2025-07-30 · unverdicted · none · ref 20 · internal anchor
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation cs.CL · 2025-07-20 · unverdicted · none · ref 11 · internal anchor
PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.
Lizard: An Efficient Linearization Framework for Large Language Models cs.CL · 2025-07-11 · unverdicted · none · ref 10 · internal anchor
Lizard linearizes Transformer LLMs via subquadratic attention and adaptive learnable modules, recovering near-original performance while outperforming prior linearization methods on MMLU and associative recall.
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks cs.CV · 2025-07-02 · unverdicted · none · ref 25 · internal anchor
Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs cs.LG · 2025-06-15 · unverdicted · none · ref 15 · internal anchor
MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 13 · internal anchor
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction cs.CV · 2025-05-26 · unverdicted · none · ref 28 · internal anchor
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.

Measuring Massive Multitask Language Understanding

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer