super hub Baseline reference

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Hyung Won Chung, Mirac Suzgun, Nathan Scales, Sebastian Gehrmann, Yi Tay · 2022 · cs.CL · arXiv 2210.09261

Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.

112 Pith papers citing it

Baseline 55% of classified citations

open full Pith review browse 112 citing papers more from Hyung Won Chung arXiv PDF

abstract

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 12 background 9 method 1

citation-polarity summary

use dataset 12 background 10

claims ledger

abstract BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we

authors

Hyung Won Chung Mirac Suzgun Nathanael Sch\"arli Nathan Scales Sebastian Gehrmann Yi Tay

co-cited works

representative citing papers

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

Will Scaling Improve Social Simulation with LLMs?

cs.CL · 2026-07-02 · conditional · novelty 7.0

Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

cs.LG · 2026-06-16 · unverdicted · novelty 7.0

Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.

Large Language Model Selection with Limited Annotations

cs.CL · 2026-05-24 · unverdicted · novelty 7.0

SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

ReElicit uses LLMs to elicit adaptive feature embeddings for Gaussian process Bayesian optimization of system prompts under aggregate-only feedback, outperforming baselines across ten tasks with a 30-evaluation budget.

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

Logic-Regularized Verifier Elicits Reasoning from LLMs

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

Verifiable Counterfactual Supervision for Process Reward Models

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

Presents verifiable counterfactual process supervision that generates annotated trajectories via template-aware error injection on symbolic chains, improving Best-of-8 reranking on logical reasoning benchmarks with preliminary math transfer.

Large Language Models Exhibit Normative Conformity

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

Large language models exhibit normative conformity in addition to informational conformity, and subtle social context can direct which group they conform to.

Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.

MARS: Enabling Autoregressive Models Multi-Token Generation

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

cs.CL · 2026-03-11 · unverdicted · novelty 7.0

PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

cs.CL · 2025-02-04 · unverdicted · novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

cs.CL · 2023-09-15 · unverdicted · novelty 7.0

EvoPrompt uses LLMs to run evolutionary operators on populations of prompts, outperforming human-engineered prompts by up to 25% on BIG-Bench Hard tasks across 31 datasets.

Large Language Models as Optimizers

cs.LG · 2023-09-07 · unverdicted · novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

cs.CL · 2022-11-22 · unverdicted · novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

citing papers explorer

Showing 50 of 51 citing papers after filters.

Will Scaling Improve Social Simulation with LLMs? cs.CL · 2026-07-02 · conditional · none · ref 36 · internal anchor
Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.
Meta-Benchmarks for Financial-Services LLM Evaluation cs.AI · 2026-07-02 · unverdicted · none · ref 23 · internal anchor
A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.
Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents cs.AI · 2026-06-29 · unverdicted · none · ref 36 · internal anchor
PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.
Beyond Prediction: Tail-Aware Scheduling for LLM Inference cs.LG · 2026-06-16 · unverdicted · none · ref 46 · internal anchor
Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.
Large Language Model Selection with Limited Annotations cs.CL · 2026-05-24 · unverdicted · none · ref 104 · internal anchor
SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.
Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts cs.AI · 2026-05-18 · unverdicted · none · ref 35 · internal anchor
ReElicit uses LLMs to elicit adaptive feature embeddings for Gaussian process Bayesian optimization of system prompts under aggregate-only feedback, outperforming baselines across ten tasks with a 30-evaluation budget.
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability cs.CL · 2026-05-12 · unverdicted · none · ref 52 · internal anchor
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost cs.AI · 2026-05-07 · conditional · none · ref 34 · internal anchor
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Logic-Regularized Verifier Elicits Reasoning from LLMs cs.CL · 2026-05-07 · unverdicted · none · ref 55 · internal anchor
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs cs.CV · 2026-05-07 · conditional · none · ref 29 · internal anchor
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
Verifiable Counterfactual Supervision for Process Reward Models cs.AI · 2026-05-04 · unverdicted · none · ref 11 · internal anchor
Presents verifiable counterfactual process supervision that generates annotated trajectories via template-aware error injection on symbolic chains, improving Best-of-8 reranking on logical reasoning benchmarks with preliminary math transfer.
Large Language Models Exhibit Normative Conformity cs.AI · 2026-04-21 · unverdicted · none · ref 23 · internal anchor
Large language models exhibit normative conformity in addition to informational conformity, and subtle social context can direct which group they conform to.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration cs.AI · 2026-04-17 · unverdicted · none · ref 54 · internal anchor
WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
MARS: Enabling Autoregressive Models Multi-Token Generation cs.CL · 2026-04-08 · unverdicted · none · ref 13 · internal anchor
MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses cs.CL · 2026-03-11 · unverdicted · none · ref 23 · internal anchor
PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs cs.LG · 2026-06-26 · unverdicted · none · ref 145 · internal anchor
Proposes SCSuff metric for evaluating LLM explanation sufficiency via model-generated alternative inputs, showing explanations are typically insufficient and predictable from hidden states.
SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference cs.LG · 2026-06-25 · unverdicted · none · ref 93 · internal anchor
SharQ combines input-adaptive N:M sparsity and FP4 quantization via sparse backbone plus dense residual, recovering 43-63% of the NVFP4-to-FP16 accuracy gap on Llama and Qwen models without calibration or retraining.
Improved Large Language Diffusion Models cs.CL · 2026-06-24 · unverdicted · none · ref 39 · internal anchor
iLLaDA is an 8B masked diffusion LM trained from scratch with bidirectional attention, reporting gains of 14-21 points on BBH, ARC, MATH and HumanEval over prior diffusion models while remaining competitive with Qwen2.5-7B.
DRIFT: Refining Instruction Data via On-Policy Data Attribution cs.LG · 2026-06-16 · unverdicted · none · ref 20 · internal anchor
DRIFT applies on-policy influence functions with signed weighting and debiasing to attribute and refine SFT data, raising performance on 7B instruction and reasoning models over prior curation methods.
Redesign Mixture-of-Experts Routers with Manifold Power Iteration cs.LG · 2026-06-10 · unverdicted · none · ref 14 · internal anchor
Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.
The Self-Correction Illusion: LLMs Correct Others but Not Themselves cs.AI · 2026-06-04 · conditional · none · ref 35 · internal anchor
Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.
Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching cs.AI · 2026-05-25 · unverdicted · none · ref 15 · internal anchor
DecoR routes LLM queries by decomposing them into capability dimensions and matching to historical examples, yielding higher accuracy and lower inference costs than direct-mapping routers on both in-distribution and OOD data.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices cs.LG · 2026-05-11 · unverdicted · none · ref 54 · 3 links · internal anchor
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization cs.LG · 2026-05-11 · unverdicted · none · ref 28 · internal anchor
Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 66 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization cs.AI · 2026-05-09 · unverdicted · none · ref 37 · 2 links · internal anchor
AgentPSO applies a particle-swarm-inspired update rule to evolve natural-language reasoning skills across multiple LLM agents, yielding gains over static and test-time multi-agent baselines with cross-benchmark transfer.
Sanity Checks for Long-Form Hallucination Detection cs.CL · 2026-05-08 · unverdicted · none · ref 16 · internal anchor
Hallucination detectors on LLM reasoning traces often rely on final-answer artifacts rather than reasoning validity; once controlled, lightweight lexical trajectory features suffice for robust detection.
From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs cs.CL · 2026-05-08 · unverdicted · none · ref 34 · internal anchor
LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning gap rather than knowledge deficits.
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations cs.CL · 2026-05-08 · unverdicted · none · ref 30 · 2 links · internal anchor
GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.
Rethinking LLM Ensembling from the Perspective of Mixture Models cs.LG · 2026-05-01 · unverdicted · none · ref 13 · 2 links · internal anchor
ME reinterprets LLM ensembling as token-level sampling from a mixture model, enabling single-model invocation per token with claimed mathematical equivalence to full ensembling and measured speedups of 1.78x-2.68x.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts cs.LG · 2026-04-20 · unverdicted · none · ref 41 · internal anchor
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis cs.AI · 2026-04-20 · unverdicted · none · ref 6 · internal anchor
ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
Agentic Frameworks for Reasoning Tasks: An Empirical Study cs.AI · 2026-04-17 · unverdicted · none · ref 10 · internal anchor
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge cs.DC · 2026-04-08 · unverdicted · none · ref 30 · internal anchor
ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, improving accuracy by up to 46.46%.
TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models cs.CL · 2026-03-13 · unverdicted · none · ref 52 · internal anchor
TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.
When LLMs get significantly worse: A statistical approach to detect model degradations stat.ML · 2026-02-09 · conditional · none · ref 13 · internal anchor
A McNemar-based statistical test detects real degradations in optimized LLMs with controlled false positives, even for accuracy changes as small as 0.3%.
Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety cs.CL · 2026-06-26 · unverdicted · none · ref 31 · internal anchor
Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
DuplexOmni: Real-Time Listening, Seeing, Thinking, and Speaking for Full-Duplex Interaction cs.HC · 2026-06-08 · unverdicted · none · ref 18 · internal anchor
DuplexOmni achieves real-time full-duplex multimodal interaction by separating an interaction layer from a pluggable thinking layer, supported by a Writer-Director pipeline for continuous-interaction training data.
Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling cs.LG · 2026-05-26 · unverdicted · none · ref 14 · internal anchor
Dense2MoE unifies pruning of attention modules with upcycling of MLPs into MoE experts to produce on-device LLMs that improve the latency-accuracy Pareto frontier.
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training cs.LG · 2026-05-25 · unverdicted · none · ref 22 · internal anchor
GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.
The Efficiency Gap in Byte Modeling cs.LG · 2026-05-13 · unverdicted · none · ref 42 · internal anchor
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability cs.AI · 2026-05-11 · unverdicted · none · ref 27 · internal anchor
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 58 · 2 links · internal anchor
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling cs.AI · 2026-05-02 · unverdicted · none · ref 11 · internal anchor
Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while self-consistency saturates.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference cs.LG · 2026-04-24 · unverdicted · none · ref 32 · internal anchor
SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and dual quantization paths.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods cs.LG · 2026-04-19 · unverdicted · none · ref 40 · internal anchor
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs cs.LG · 2026-04-08 · unverdicted · none · ref 70 · internal anchor
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
Attention Residuals cs.CL · 2026-03-16 · unverdicted · none · ref 49 · internal anchor
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
Mellum2 Technical Report cs.CL · 2026-05-29 · unverdicted · none · ref 70 · internal anchor
Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer