hub Baseline reference

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi · 2021

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

17 Pith papers citing it

Baseline 60% of classified citations

browse 17 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

dataset 3 background 2

citation-polarity summary

use dataset 3 background 2

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

cs.LG · 2026-05-12 · accept · novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Dynamic Chunking for Diffusion Language Models

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.

Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Nonlinear Bipolar Compensation with Bipolar Logarithmic Transformation reduces outlier effects in post-training quantization by performing compensation in a compressed transformed space.

Understanding and Accelerating the Training of Masked Diffusion Language Models

cs.LG · 2026-05-13 · conditional · novelty 6.0

Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.

Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning

cs.LG · 2025-10-27 · unverdicted · novelty 6.0

ScaLoRA analytically derives per-update column scalings that let low-rank increments accumulate into high-rank weight updates, yielding faster convergence and higher accuracy than prior LoRA variants on LLMs up to 12B parameters.

TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

cs.LG · 2025-06-02 · unverdicted · novelty 6.0

TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.

Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates

cs.LG · 2026-05-19 · unverdicted · novelty 5.0

FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.

A Composite Activation Function for Learning Stable Binary Representations

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.

Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

cs.CL · 2026-05-06 · unverdicted · novelty 5.0 · 2 refs

LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.

citing papers explorer

Showing 17 of 17 citing papers.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts cs.LG · 2026-05-13 · unverdicted · none · ref 55
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models cs.LG · 2026-05-12 · accept · none · ref 32
Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 116
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Dynamic Chunking for Diffusion Language Models cs.CL · 2026-05-15 · unverdicted · none · ref 36
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models cs.LG · 2026-05-22 · unverdicted · none · ref 45
Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.
Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization cs.CV · 2026-05-14 · unverdicted · none · ref 46
Nonlinear Bipolar Compensation with Bipolar Logarithmic Transformation reduces outlier effects in post-training quantization by performing compensation in a compressed transformed space.
Understanding and Accelerating the Training of Masked Diffusion Language Models cs.LG · 2026-05-13 · conditional · none · ref 60
Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression cs.LG · 2026-05-09 · unverdicted · none · ref 44
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 41
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation cs.AI · 2026-05-06 · unverdicted · none · ref 51
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling cs.CL · 2026-04-27 · unverdicted · none · ref 42
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning cs.LG · 2025-10-27 · unverdicted · none · ref 37
ScaLoRA analytically derives per-update column scalings that let low-rank increments accumulate into high-rank weight updates, yielding faster convergence and higher accuracy than prior LoRA variants on LLMs up to 12B parameters.
TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network cs.LG · 2025-06-02 · unverdicted · none · ref 50
TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.
Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates cs.LG · 2026-05-19 · unverdicted · none · ref 50
FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.
A Composite Activation Function for Learning Stable Binary Representations cs.LG · 2026-05-12 · unverdicted · none · ref 59
HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 67 · 2 links
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training cs.CL · 2026-05-06 · unverdicted · none · ref 20 · 2 links
LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer