super hub Mixed citations

Measuring Massive Multitask Language Understanding

Andy Zou, Collin Burns, Dan Hendrycks, Dawn Song, Mantas Mazeika, Steven Basart · 2020 · cs.CY · arXiv 2009.03300

Mixed citation behavior. Most common role is background (45%).

450 Pith papers citing it

Background 45% of classified citations

open full Pith review browse 450 citing papers more from Andy Zou arXiv PDF

abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 dataset 29 method 5 baseline 3

citation-polarity summary

background 30 use dataset 27 use method 5 baseline 3 unclear 2

claims ledger

abstract We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models

authors

Andy Zou Collin Burns Dan Hendrycks Dawn Song Mantas Mazeika Steven Basart

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

FlipGuard: Defending Large Language Models Against Quantization-Conditioned Backdoor Attacks

cs.CR · 2026-06-27 · unverdicted · novelty 7.0

FlipGuard perturbs LLM weights prior to quantization to neutralize quantization-conditioned backdoor attacks, evaluated via the Defense Effectiveness Ratio on multiple models and quantization schemes.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

RealClawBench turns 281 real OpenClaw sessions into reproducible tasks that preserve the original distribution and shows the best of 14 models solves only 65.8 percent.

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

BigFinanceBench is a workflow-grounded benchmark of 928 financial research tasks with point-weighted rubrics, where the best of ten tested agents scores 58.8% on derivation quality.

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

cs.DC · 2026-06-01 · unverdicted · novelty 7.0

A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.

From Talking Words to Sharing Thoughts: Scalable Multi-LLM Aggregation via Structured Message Passing

cs.GT · 2026-05-29 · unverdicted · novelty 7.0

A bipartite factor graph with message-passing protocol and asymmetric damping aggregates multi-LLM predictions, cutting token use by 97% and API calls by 6X while outperforming baselines on MMLU, MMLU-Pro, GPQA, and MedMCQA.

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

RHELM is a benchmark for LLM long-term memory with dynamic profiles, heterogeneous sources, and 27 memory characteristics that reveals weaknesses in existing models for multi-source aggregation and contextual reasoning.

ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

ReactBench is a new benchmark with four cause-targeted tasks that uses adversarial images, hallucination-inducing queries, and Chain-of-Thought analysis to expose specific failure modes in current multimodal large language models.

K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

K-FinHallu is the first multi-turn Korean financial RAG hallucination benchmark; frontier LLMs struggle especially on justified abstention while an 8B fine-tuned model reaches competitive performance.

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

ConMoE consolidates MoE experts into a smaller prototype pool via deterministic remapping based on contribution and replaceability, matching or beating pruning/merging baselines at 25-50% reduction on three models.

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.

citing papers explorer

Showing 50 of 120 citing papers after filters.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts cs.LG · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
Crafting Reversible SFT Behaviors in Large Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 37 · internal anchor
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
Toward Calibrated, Fair, and accurate Deepfake Detection cs.LG · 2026-06-03 · unverdicted · none · ref 58 · internal anchor
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance cs.LG · 2026-05-28 · unverdicted · none · ref 8 · internal anchor
K-FinHallu is the first multi-turn Korean financial RAG hallucination benchmark; frontier LLMs struggle especially on justified abstention while an 8B fine-tuned model reaches competitive performance.
ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling cs.LG · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.
X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation cs.LG · 2026-05-20 · conditional · none · ref 13 · internal anchor
X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.
Widening the Gap: Exploiting LLM Quantization via Outlier Injection cs.LG · 2026-05-14 · conditional · none · ref 36 · internal anchor
The paper introduces an outlier-injection attack that induces targeted weight collapse in LLMs under advanced quantization schemes including AWQ, GPTQ, and GGUF I-quants.
Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training cs.LG · 2026-05-14 · unverdicted · none · ref 13 · internal anchor
PODS is a plug-and-play oscillatory data-volume scheduler that alternates low-ratio regularization phases with high-ratio recovery phases to improve data selection efficiency across training tasks.
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation cs.LG · 2026-05-14 · unverdicted · none · ref 17 · internal anchor
RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs cs.LG · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.
Task-Aware Calibration: Provably Optimal Decoding in LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models cs.LG · 2026-05-10 · unverdicted · none · ref 55 · internal anchor
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic cs.LG · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation cs.LG · 2026-05-08 · unverdicted · none · ref 19 · internal anchor
PropGuard is a propagation-aware framework for LLM-MAS that constructs dual-view spatio-temporal graphs, employs a GE-GRPO inspector to recover suspicious subgraphs, and applies source-guided remediation to lower attack success while preserving task performance.
LoopQ: Quantization for Recursive Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity under W4A4 versus static baselines.
Dataset Watermarking for Closed LLMs with Provable Detection cs.LG · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving cs.LG · 2026-05-03 · unverdicted · none · ref 25 · 2 links · internal anchor
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders cs.LG · 2026-04-21 · unverdicted · none · ref 40 · internal anchor
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations cs.LG · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.
Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition cs.LG · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
MATU quantifies uncertainty in LLM multi-agent systems by turning reasoning trajectories into embedding matrices, stacking runs into a tensor, and decomposing it to separate sources of variability.
Path-Constrained Mixture-of-Experts cs.LG · 2026-03-18 · unverdicted · none · ref 6 · internal anchor
PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE cs.LG · 2026-03-06 · conditional · none · ref 18 · internal anchor
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
Norm Anchors Make Model Edits Last cs.LG · 2026-01-30 · conditional · none · ref 11 · internal anchor
Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.
MIDUS: Memory-Infused Depth Up-Scaling cs.LG · 2025-12-15 · unverdicted · none · ref 9 · internal anchor
MIDUS replaces duplicated FFN branches in depth up-scaling with head-wise memory layers using product-key retrieval and HIVE to deliver lightweight, head-conditioned residual capacity.
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training cs.LG · 2025-07-21 · unverdicted · none · ref 10 · internal anchor
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
Functional-level Uncertainty Quantification for Calibrated Fine-tuning on LLMs cs.LG · 2024-10-09 · unverdicted · none · ref 23 · internal anchor
UQ4CT integrates functional-level uncertainty calibration into mixture-of-experts LoRA fine-tuning via a dedicated loss, cutting expected calibration error by over 25% on multiple-choice and generative QA tasks.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 136 · internal anchor
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining cs.LG · 2026-06-29 · unverdicted · none · ref 11 · internal anchor
One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection cs.LG · 2026-06-02 · unverdicted · none · ref 62 · 2 links · internal anchor
LiftQuant enables continuous bit-width LLM quantization via dimensional lifting and projection from a 1-bit lattice, allowing 2.4-bit compression of 70B models that outperforms fixed 2-bit baselines on identical hardware.
Enhancing LLM Metacognition via Cognitive Pairwise Training cs.LG · 2026-05-30 · unverdicted · none · ref 67 · internal anchor
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
ProjQ: Project-and-Quantize for Adapter-Aware LLM Compression cs.LG · 2026-05-30 · unverdicted · none · ref 74 · internal anchor
ProjQ constrains post-training quantization noise to a low-rank manifold through orthogonal subspace projection, enabling better compensation by LoRA adapters and preserving greater model plasticity than standard PTQ.
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers cs.LG · 2026-05-29 · unverdicted · none · ref 39 · 2 links · internal anchor
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
On Effectiveness and Efficiency of Agentic Tool-calling and RL Training cs.LG · 2026-05-28 · unverdicted · none · ref 35 · internal anchor
Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.
PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning cs.LG · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
PEARL is a pedagogically aligned RL framework using a controllable student simulator, generative reward model, and stable multi-objective scheme to train Socratic tutors that outperform other open-source models on benchmarks.
Knowledge Offloading: Decomposing LLMs into Sparse Backbones and Memory Modules cs.LG · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
KOFF prunes LLMs to ~12% sparsity while adding LoRA and learned KV memories, preserving performance where plain pruning fails across 3B-8B Llama and Qwen models.
A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving cs.LG · 2026-05-26 · unverdicted · none · ref 11 · internal anchor
The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.
MobileMoE: Scaling On-Device Mixture of Experts cs.LG · 2026-05-26 · unverdicted · none · ref 19 · internal anchor
MobileMoE introduces on-device MoE LLMs that match dense models with 2-4x fewer FLOPs and provide efficient smartphone inference.
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training cs.LG · 2026-05-26 · unverdicted · none · ref 17 · internal anchor
MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 11 · internal anchor
GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.
PACE: Two-Timescale Self-Evolution for Small Language Model Agents cs.LG · 2026-05-21 · unverdicted · none · ref 21 · internal anchor
PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.
Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT? cs.LG · 2026-05-21 · unverdicted · none · ref 7 · internal anchor
RL preserves a larger fraction of base model circuits than SFT during fine-tuning on scientific QA, per a new head-level differential circuit vulnerability metric, at the cost of slower adaptation.
ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning cs.LG · 2026-05-20 · conditional · none · ref 35 · internal anchor
ChunkFT enables full-parameter fine-tuning of Llama 3-8B on one 24 GB GPU and Llama 3-70B on two 80 GB GPUs by streaming gradients over dynamically activated sub-tensors.
Reading Calibrated Uncertainty from Language Model Trajectories cs.LG · 2026-05-19 · unverdicted · none · ref 7 · internal anchor
Geometric features from per-layer MLP update trajectories fed to a sparse linear probe outperform maximum softmax probability for uncertainty quantification under selective abstention, with gains up to 21 AURC points.
The Evaluation Game: Beyond Static LLM Benchmarking cs.LG · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
Presents a game-theoretic model with group actions for data augmentation in LLM adversarial evaluation, demonstrating local generalization from fine-tuning on three model families and redefining benchmarks as orbits under group actions.
ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models cs.LG · 2026-05-16 · unverdicted · none · ref 7 · 3 links · internal anchor
ZeroUnlearn reformulates machine unlearning as knowledge re-mapping via model editing, using multiplicative updates with closed-form solutions for efficient few-shot removal of sensitive representations while preserving utility.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings cs.LG · 2026-05-13 · conditional · none · ref 6 · internal anchor
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy cs.LG · 2026-05-13 · unverdicted · none · ref 19 · 2 links · internal anchor
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion cs.LG · 2026-05-12 · unverdicted · none · ref 11 · 2 links · internal anchor
Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs cs.LG · 2026-05-12 · unverdicted · none · ref 24 · internal anchor
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 122 · internal anchor
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

Measuring Massive Multitask Language Understanding

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer