super hub Mixed citations

Measuring Massive Multitask Language Understanding

Andy Zou, Collin Burns, Dan Hendrycks, Dawn Song, Mantas Mazeika, Steven Basart · 2020 · cs.CY · arXiv 2009.03300

Mixed citation behavior. Most common role is background (45%).

380 Pith papers citing it

Background 45% of classified citations

open full Pith review browse 380 citing papers more from Andy Zou arXiv PDF

abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 dataset 29 method 5 baseline 3

citation-polarity summary

background 30 use dataset 27 use method 5 baseline 3 unclear 2

claims ledger

abstract We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models

authors

Andy Zou Collin Burns Dan Hendrycks Dawn Song Mantas Mazeika Steven Basart

co-cited works

representative citing papers

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

From Talking Words to Sharing Thoughts: Scalable Multi-LLM Aggregation via Structured Message Passing

cs.GT · 2026-05-29 · unverdicted · novelty 7.0

A bipartite factor graph with message-passing protocol and asymmetric damping aggregates multi-LLM predictions, cutting token use by 97% and API calls by 6X while outperforming baselines on MMLU, MMLU-Pro, GPQA, and MedMCQA.

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

RHELM is a benchmark for LLM long-term memory with dynamic profiles, heterogeneous sources, and 27 memory characteristics that reveals weaknesses in existing models for multi-source aggregation and contextual reasoning.

ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

ReactBench is a new benchmark with four cause-targeted tasks that uses adversarial images, hallucination-inducing queries, and Chain-of-Thought analysis to expose specific failure modes in current multimodal large language models.

K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

K-FinHallu is the first multi-turn Korean financial RAG hallucination benchmark; frontier LLMs struggle especially on justified abstention while an 8B fine-tuned model reaches competitive performance.

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

ConMoE consolidates MoE experts into a smaller prototype pool via deterministic remapping based on contribution and replaceability, matching or beating pruning/merging baselines at 25-50% reduction on three models.

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference

cs.DC · 2026-05-27 · unverdicted · novelty 7.0

SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.

Self-Policy Distillation via Capability-Selective Subspace Projection

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

cs.LG · 2026-05-20 · conditional · novelty 7.0

X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4B models.

Inducing Artificial Uncertainty in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.

Task-Aware Calibration: Provably Optimal Decoding in LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.

citing papers explorer

Showing 50 of 150 citing papers after filters.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks cs.CL · 2026-04-19 · unverdicted · none · ref 29 · internal anchor
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 111 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory cs.CL · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
RHELM is a benchmark for LLM long-term memory with dynamic profiles, heterogeneous sources, and 27 memory characteristics that reveals weaknesses in existing models for multi-source aggregation and contextual reasoning.
Self-Policy Distillation via Capability-Selective Subspace Projection cs.CL · 2026-05-21 · unverdicted · none · ref 31 · internal anchor
Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation cs.CL · 2026-05-14 · unverdicted · none · ref 31 · internal anchor
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights cs.CL · 2026-05-13 · unverdicted · none · ref 51 · internal anchor
TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4B models.
Inducing Artificial Uncertainty in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 12 · internal anchor
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 81 · internal anchor
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages cs.CL · 2026-05-05 · accept · none · ref 35 · internal anchor
Nsanku benchmark shows current LLMs achieve only modest zero-shot translation scores on 43 Ghanaian languages, with no model reaching both high average performance and high cross-language consistency.
Green Shielding: A User-Centric Approach Towards Trustworthy AI cs.CL · 2026-04-27 · unverdicted · none · ref 47 · internal anchor
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them cs.CL · 2026-04-18 · unverdicted · none · ref 22 · internal anchor
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options cs.CL · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
Scaling multiple-choice questions to 100 options on a Korean error detection task shows that LLM performance on conventional benchmarks overstates true competence due to shortcut strategies.
Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning cs.CL · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
GRIP integrates retrieval into autoregressive generation through self-triggered control tokens for dynamic query planning, outperforming RAG baselines on QA benchmarks with fewer parameters than GPT-4o.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives cs.CL · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
Social dynamics in LLM collectives cause representative agents to make less accurate decisions as peer pressure increases through larger adversarial groups, more capable peers, longer arguments, and persuasive styles.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks cs.CL · 2026-04-07 · unverdicted · none · ref 12 · internal anchor
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
DeonticBench: A Benchmark for Reasoning over Rules cs.CL · 2026-04-06 · unverdicted · none · ref 15 · internal anchor
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models cs.CL · 2026-04-03 · unverdicted · none · ref 4 · internal anchor
A new Latent Imagination Module uses cross-attention to predict latent visual embeddings from text, improving accuracy and calibration of vision-language models on text-only inputs.
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context cs.CL · 2026-03-18 · conditional · none · ref 2 · internal anchor
KMMMU benchmark demonstrates that leading multimodal models achieve at most 52.42% accuracy on hard Korean exam questions, highlighting limitations in non-English multimodal understanding.
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses cs.CL · 2026-03-11 · unverdicted · none · ref 25 · internal anchor
PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective cs.CL · 2026-02-03 · unverdicted · none · ref 27 · internal anchor
A learned transformation matrix minimizes CMI in teacher logits to degrade distillation performance while preserving task accuracy.
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning cs.CL · 2025-03-06 · unverdicted · none · ref 8 · internal anchor
LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.
Stay Focused: Problem Drift in Multi-Agent Debate cs.CL · 2025-02-26 · unverdicted · none · ref 4 · internal anchor
The paper defines and measures 'problem drift' in multi-agent LLM debates across tasks and proposes DRIFTJudge and DRIFTPolicy as baselines to detect and reduce it.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression cs.CL · 2025-02-04 · unverdicted · none · ref 37 · internal anchor
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs in Medicine cs.CL · 2024-10-17 · unverdicted · none · ref 1 · internal anchor
Introduces RiTeK dataset for complex LLM reasoning over medical TKGs with expert-validated queries and shows existing retrievers struggle on the benchmark.
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing cs.CL · 2024-06-12 · unverdicted · none · ref 111 · internal anchor
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 99 · internal anchor
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens cs.CL · 2024-02-21 · unverdicted · none · ref 5 · internal anchor
LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.
Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 53 · internal anchor
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA cs.CL · 2023-11-28 · unverdicted · none · ref 32 · internal anchor
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
WizardLM: Empowering large pre-trained language models to follow complex instructions cs.CL · 2023-04-24 · conditional · none · ref 18 · internal anchor
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
Capabilities of GPT-4 on Medical Challenge Problems cs.CL · 2023-03-20 · unverdicted · none · ref 6 · internal anchor
GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.
Fine-Tuning Improves Information Conveyance in Language Models cs.CL · 2026-05-29 · unverdicted · none · ref 12 · internal anchor
Fine-tuning reorganizes uncertainty in LLMs into more efficient information conveyance, as shown by stronger length-entropy correlations and a tripling of entropy-semantic diversity links after controls.
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty cs.CL · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
Examines uncertainty alignment with humans in LLM behavior and activations, its co-occurrence with calibration on multiple-choice and open-ended factual tasks, and effects of instruct fine-tuning.
Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs cs.CL · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
Safety enforcement in aligned MoE LLMs is localized to specific experts and can be altered independently of the model's topic-driven routing patterns via a new red-teaming method called RASET.
ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 8 · internal anchor
ChronoMedKG builds a temporal biomedical KG with 460k evidence-linked triples across 13k diseases using LLM consensus and introduces the ChronoTQA benchmark showing RAG gains on time-sensitive questions.
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention cs.CL · 2026-05-18 · unverdicted · none · ref 23 · internal anchor
DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.
Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains cs.CL · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
K2V extends RLVR to knowledge-intensive domains by synthesizing verifiable data and verifying reasoning processes, yielding improved domain reasoning with preserved general capabilities.
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models cs.CL · 2026-05-14 · unverdicted · none · ref 13 · 2 links · internal anchor
FINESSE-Bench is a new hierarchical benchmark suite combining certification-style exams, trading tasks, and a Russian olympiad set to evaluate LLMs on financial competencies at multiple difficulty levels.
Edit-Based Refinement for Parallel Masked Diffusion Language Models cs.CL · 2026-05-10 · unverdicted · none · ref 12 · internal anchor
ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
Decomposing and Steering Functional Metacognition in Large Language Models cs.CL · 2026-05-09 · unverdicted · none · ref 8 · internal anchor
LLMs have linearly decodable functional metacognitive states that causally modulate reasoning when steered via activation interventions.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 53 · 2 links · internal anchor
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.
Continuous Latent Diffusion Language Model cs.CL · 2026-05-07 · unverdicted · none · ref 33 · internal anchor
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine cs.CL · 2026-05-01 · unverdicted · none · ref 10 · 2 links · internal anchor
CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.
Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning cs.CL · 2026-05-01 · unverdicted · none · ref 5 · internal anchor
TokenUnlearn identifies critical tokens via masking and entropy signals then applies hard selection or soft weighting to unlearn only those tokens, yielding better forgetting and retained utility than sequence-level baselines on TOFU and WMDP.
Mixture of Heterogeneous Grouped Experts for Language Modeling cs.CL · 2026-04-25 · unverdicted · none · ref 8 · internal anchor
MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization cs.CL · 2026-04-21 · unverdicted · none · ref 33 · internal anchor
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
Are Large Language Models Economically Viable for Industry Deployment? cs.CL · 2026-04-21 · unverdicted · none · ref 50 · internal anchor
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL cs.CL · 2026-04-18 · unverdicted · none · ref 2 · internal anchor
A 3B model trained via clarification-aware RLVR improves abstention and post-refusal clarification on unanswerable queries while matching larger models like DeepSeek-R1 on benchmarks.
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models cs.CL · 2026-04-18 · unverdicted · none · ref 11 · internal anchor
SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance cs.CL · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
A training-free method improves epistemic faithfulness of LLM textual explanations by guiding generation with attribution-based attention interventions.

Measuring Massive Multitask Language Understanding

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer