pith. sign in

arxiv: 1910.01108 · v4 · submitted 2019-10-02 · 💻 cs.CL

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Pith reviewed 2026-05-11 04:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords DistilBERTknowledge distillationBERTmodel compressionpre-traininglanguage modelsnatural language processingon-device computation
0
0 comments X

The pith

DistilBERT is a 40% smaller version of BERT that retains 97% of its language understanding while running 60% faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that knowledge distillation can be applied during the pre-training phase to create a compact general-purpose language model from BERT. By training the smaller student model with a triple loss that includes language modeling, distillation from the teacher, and cosine-distance terms, the authors transfer enough knowledge to preserve most capabilities. This matters because large pre-trained models are difficult to deploy under tight compute budgets on edge devices or in constrained environments. The resulting DistilBERT can later be fine-tuned on downstream tasks without needing the full original model size.

Core claim

We introduce DistilBERT, a smaller general-purpose language representation model pre-trained using knowledge distillation, which reduces the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

What carries the argument

Triple loss combining language modeling, distillation, and cosine-distance losses during pre-training to transfer knowledge to the smaller student model.

Load-bearing premise

The combination of language modeling, distillation, and cosine-distance losses transfers enough knowledge from the full BERT teacher to the smaller student without requiring the full model capacity or additional task-specific supervision.

What would settle it

If DistilBERT's fine-tuned performance on standard NLP benchmarks falls below 97% of BERT's scores or if measured inference speed gains are less than 60% in direct side-by-side tests, the central claims would not hold.

read the original abstract

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces DistilBERT, a 6-layer distilled version of BERT-base (66M parameters) pre-trained with a triple loss combining masked language modeling, knowledge distillation, and cosine-distance embedding losses. It claims a 40% size reduction while retaining 97% of BERT's performance on language understanding tasks, 60% faster inference, and suitability for on-device use, supported by evaluations on GLUE (average 97% relative score), SQuAD, IMDB, loss ablations, a from-scratch 6-layer baseline comparison, and CPU/GPU speed measurements with reported batch sizes.

Significance. If the empirical results hold, the work offers a practical pre-training distillation method that enables smaller general-purpose language models without task-specific supervision, directly addressing deployment constraints. Strengths include the ablation evidence in §3.3 showing each loss term's contribution, the underperformance of the non-distilled baseline, and concrete inference timings, which together provide reproducible support for the central efficiency claims.

minor comments (2)
  1. [Abstract] Abstract: performance claims (97% retention, 60% speedup) are stated without reference to the specific downstream tasks or variance; while §3 and tables provide these details, a one-sentence qualifier on evaluation scope would improve standalone readability.
  2. [§3.3] §3.3: ablation results demonstrate the value of each loss component, but the table does not report run-to-run variance or number of seeds; adding this would make the contribution of the cosine term more robust.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the DistilBERT paper, recognition of its practical contributions to model compression, and recommendation for minor revision. We are pleased that the ablation evidence, baseline comparisons, and concrete speed measurements were noted as providing reproducible support for the claims.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is an empirical training procedure: a 6-layer student model is pre-trained on the same corpus as BERT-base using a composite loss (MLM + distillation + cosine embedding) and then evaluated on GLUE, SQuAD, and IMDB. All reported performance numbers (97 % relative GLUE score, 60 % speed-up, 40 % size reduction) are obtained by direct measurement after training; no equation or prediction is shown to be mathematically identical to a fitted parameter or to a self-citation chain. Ablations in §3.3 and the from-scratch baseline comparison further demonstrate that the result is not forced by construction. The work therefore rests on externally verifiable experimental outcomes rather than on any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The method implicitly assumes standard transformer architecture and knowledge-distillation transferability, which are treated as background rather than paper-specific inventions.

pith-pipeline@v0.9.0 · 5484 in / 1007 out tokens · 24625 ms · 2026-05-11T04:58:48.987359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses.

  • Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    DistilBERT (6 layers, 66M params) is compared to BERT-base on GLUE (avg. 97% relative score), SQuAD, and IMDB; ablations in §3.3 show each term of the triple loss (MLM + distillation + cosine) contributes

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Canonical Regularisation of Wide Feature-Learning Neural Networks

    stat.ML 2026-05 unverdicted novelty 8.0

    Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.

  2. Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...

  3. Learning the Signature of Memorization in Autoregressive Language Models

    cs.CL 2026-04 accept novelty 8.0

    A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

  4. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

    cs.CL 2023-05 conditional novelty 8.0

    Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

  5. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  6. Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging

    hep-ex 2026-05 unverdicted novelty 7.0

    PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.

  7. Distribution-free root cause analysis

    stat.ME 2026-05 unverdicted novelty 7.0

    CROC constructs finite-sample valid confidence sets for the root-cause index in multi-stream change detection using conformal p-values under independence and exchangeability assumptions.

  8. AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

    cs.CV 2026-05 unverdicted novelty 7.0

    The paper presents AIGaitor, a privacy-preserving on-device monocular motion analysis system that performs end-to-end pose estimation and deep learning gait analysis on consumer smartphones.

  9. Layer-wise Token Compression for Efficient Document Reranking

    cs.IR 2026-05 unverdicted novelty 7.0

    Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, ...

  10. Layer-wise Token Compression for Efficient Document Reranking

    cs.IR 2026-05 conditional novelty 7.0

    Layer-wise Token Compression applies adaptive pooling at middle transformer layers to increase QPS by up to 116% on document ranking with little or no loss in quality.

  11. TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics

    cs.OS 2026-05 unverdicted novelty 7.0

    TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.

  12. Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

    cs.CL 2026-05 unverdicted novelty 7.0

    RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain o...

  13. Differentially Private Motif-Preserving Multi-modal Hashing

    cs.IR 2026-05 unverdicted novelty 7.0

    DMP-MH clips degrees to control triangle sensitivity, synthesizes an edge-DP graph with Noisy Mirror Descent, and distills it into dual-stream hash networks, beating private baselines by up to 11.4 mAP on MIRFlickr-25...

  14. From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.

  15. When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity

    cs.LG 2026-05 unverdicted novelty 7.0

    Foundation model priors amplify worst-client disparity under extreme federated heterogeneity, creating a fairness paradox where larger models perform worse for disadvantaged clients.

  16. Switchcraft: AI Model Router for Agentic Tool Calling

    cs.AI 2026-05 unverdicted novelty 7.0

    Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.

  17. TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models

    stat.ML 2026-05 unverdicted novelty 7.0

    TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.

  18. A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

    cs.CL 2026-05 unverdicted novelty 7.0

    Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.

  19. DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

    cs.SE 2026-04 unverdicted novelty 7.0

    DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...

  20. VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

    cs.CR 2026-04 unverdicted novelty 7.0

    VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.

  21. Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

    astro-ph.GA 2026-04 unverdicted novelty 7.0

    A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

  22. AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

    cs.AI 2026-04 conditional novelty 7.0

    AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment pre...

  23. Adaptive Head Budgeting for Efficient Multi-Head Attention

    cs.LG 2026-04 unverdicted novelty 7.0

    BudgetFormer adaptively budgets the number and selection of attention heads per input in Transformers, reducing FLOPs and memory on text classification while matching or exceeding standard multi-head performance.

  24. RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

    cs.CL 2026-04 unverdicted novelty 7.0

    RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.

  25. GuardPhish: Securing Open-Source LLMs from Phishing Abuse

    cs.CR 2026-04 unverdicted novelty 7.0

    Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.

  26. Depth Adaptive Efficient Visual Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 7.0

    DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.

  27. SecureRouter: Encrypted Routing for Efficient Secure Inference

    cs.CR 2026-04 unverdicted novelty 7.0

    SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.

  28. Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

    cs.CL 2026-04 conditional novelty 7.0

    Synthetic data of 1M+ multi-label samples across 23 languages trains models that match or exceed English-only specialists on zero-shot benchmarks for emotion classification.

  29. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  30. Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

    cs.CL 2026-04 unverdicted novelty 7.0

    Kathleen performs byte-level text classification via recurrent oscillator banks, FFT wavetable encoding, and phase harmonics, matching pretrained baselines on standard benchmarks with 36% fewer parameters.

  31. Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

    cs.CL 2026-04 unverdicted novelty 7.0

    Kathleen uses recurrent oscillator banks, an efficient wavetable encoder, and phase harmonics to classify text at the byte level with high accuracy and low parameter count.

  32. A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.

  33. Explainable Semantic Textual Similarity via Dissimilar Span Detection

    cs.CL 2026-03 unverdicted novelty 7.0

    Introduces the Dissimilar Span Detection task and Span Similarity Dataset to explain semantic textual similarity by identifying differing spans between text pairs.

  34. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    cs.LG 2026-01 unverdicted novelty 7.0

    A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...

  35. DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack

    cs.CR 2025-12 unverdicted novelty 7.0

    DualGuard uses adaptive dual-stream watermark signals to detect and trace both paraphrase and spoofing attacks in LLM outputs while preserving text quality.

  36. Language-Conditioned Safe Trajectory Generation for Spacecraft Rendezvous

    cs.RO 2025-12 unverdicted novelty 7.0

    SAGES translates natural-language commands into constraint-respecting spacecraft trajectories, achieving over 90% semantic-behavioral consistency in proximity operations and robotic tests.

  37. SAM 3: Segment Anything with Concepts

    cs.CV 2025-11 unverdicted novelty 7.0

    SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

  38. Task complexity shapes internal representations and robustness in neural networks

    cs.LG 2025-08 unverdicted novelty 7.0

    Harder classification tasks produce neural representations whose accuracy collapses under binarization and shuffling while easier tasks remain robust, defining task complexity via the performance gap between full-prec...

  39. A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

    cs.CV 2025-03 unverdicted novelty 7.0

    DBAC is a new directional metric for bias amplification in image captions that is less sensitive to sentence encoders and more accurate than LIC, validated on COCO gender and race attributes.

  40. Post-detection inference for sequential changepoint localization

    stat.ML 2025-02 unverdicted novelty 7.0

    Develops a general nonparametric framework for constructing non-asymptotically valid confidence sets for changepoint location using data up to an arbitrary detection stopping time.

  41. Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

    cs.LG 2025-02 unverdicted novelty 7.0

    Neurons exhibit concept-conditioned activation ranges forming Gaussian-like distributions with minimal overlap, and range-based interventions via NeuronLens outperform neuron-level masking in targeted manipulation wit...

  42. Eliciting Latent Predictions from Transformers with the Tuned Lens

    cs.LG 2023-03 accept novelty 7.0

    Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

  43. Accelerating Large Language Model Decoding with Speculative Sampling

    cs.CL 2023-02 accept novelty 7.0

    Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.

  44. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  45. Strong Teacher Not Needed? On Distillation in LLM Pretraining

    cs.LG 2026-05 unverdicted novelty 6.0

    Even small or undertrained teachers improve larger LLM students via distillation with tuned loss mixing, while stronger teachers can saturate or reverse gains and distillation aids generalization more than in-domain fit.

  46. Multimodal Distribution Matching for Vision-Language Dataset Distillation

    cs.CV 2026-05 unverdicted novelty 6.0

    MDM distills vision-language datasets via joint embedding clustering, weight-space model interpolation, and geometry-aware distribution matching on the unit hypersphere.

  47. Convex Optimization for Alignment and Preference Learning on a Single GPU

    cs.LG 2026-05 unverdicted novelty 6.0

    COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models...

  48. Proxy-Based Approximation of Shapley and Banzhaf Interactions

    cs.LG 2026-05 unverdicted novelty 6.0

    ProxySHAP uses tree proxies plus residual correction to achieve state-of-the-art approximation of Shapley and Banzhaf interactions, with a polynomial-time exact method for tree ensembles.

  49. Proxy-Based Approximation of Shapley and Banzhaf Interactions

    cs.LG 2026-05 unverdicted novelty 6.0

    ProxySHAP approximates higher-order Shapley and Banzhaf interactions via tree proxies plus residual correction and a polynomial-time interventional TreeSHAP generalization for tree ensembles.

  50. Post-Trained MoE Can Skip Half Experts via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    ZEDA injects zero-output experts and uses two-stage self-distillation to adapt post-trained MoE models into dynamic ones that skip over half the experts, yielding 1.2x inference speedup with small accuracy drops.

  51. DP-SelFT: Differentially Private Selective Fine-Tuning for Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    DP-SelFT improves the privacy-utility trade-off for LLM fine-tuning by selecting robust layer subsets via DP synthetic data and perturbation-matched evaluation.

  52. From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...

  53. On the Burden of Achieving Fairness in Conformal Prediction

    stat.ML 2026-05 unverdicted novelty 6.0

    Pooled conformal calibration incurs irreducible group-wise coverage distortion set by cross-group quantile heterogeneity, and Equalized Coverage and Equalized Set Size are in fundamental tension.

  54. On the Burden of Achieving Fairness in Conformal Prediction

    stat.ML 2026-05 unverdicted novelty 6.0

    Pooled conformal calibration incurs irreducible group-wise coverage distortion scaled by cross-group quantile heterogeneity, with Equalized Coverage and Equalized Set Size in fundamental tension.

  55. Distribution Corrected Offline Data Distillation for Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.

  56. N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

    cs.LG 2026-05 unverdicted novelty 6.0

    N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.

  57. Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...

  58. BoolXLLM: LLM-Assisted Explainability for Boolean Models

    cs.AI 2026-05 unverdicted novelty 6.0

    BoolXLLM augments an existing Boolean rule learner with LLMs for feature selection, discretization thresholds, and natural-language rule translation to improve interpretability while preserving accuracy.

  59. Unified Approach for Weakly Supervised Multicalibration

    stat.ML 2026-05 unverdicted novelty 6.0

    A unified framework uses contamination-matrix risk rewrites and witness-based calibration constraints to estimate and correct multicalibration under weak supervision, providing finite-sample guarantees and the WLMC po...

  60. Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

    cs.CL 2026-05 conditional novelty 6.0

    Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 162 Pith papers · 2 internal anchors

  1. [1]

    NIPS , year=

    Attention Is All You Need , author=. NIPS , year=

  2. [2]

    NAACL-HLT , year=

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. NAACL-HLT , year=

  3. [3]

    Language Models are Unsupervised Multitask Learners , author=

  4. [4]

    ArXiv , year=

    RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. ArXiv , year=

  5. [5]

    ArXiv , year=

    Distilling the Knowledge in a Neural Network , author=. ArXiv , year=

  6. [6]

    KDD , year=

    Model compression , author=. KDD , year=

  7. [7]

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Rethinking the Inception Architecture for Computer Vision , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  8. [8]

    International Conference on Learning Representations , year=

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. International Conference on Learning Representations , year=

  9. [9]

    2015 IEEE International Conference on Computer Vision (ICCV) , year=

    Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , author=. 2015 IEEE International Conference on Computer Vision (ICCV) , year=

  10. [10]

    ArXiv , year=

    SpanBERT: Improving Pre-training by Representing and Predicting Spans , author=. ArXiv , year=

  11. [11]

    Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R

    Alex Wang and Ian F. Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R. Thomas McCoy and Roma Patel and Yinghui Huang and Jason Phang and Edouard Grave and Najoung Kim and Phu Mon Htut and Thibault F'

  12. [12]

    ICLR , year=

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. ICLR , year=

  13. [13]

    and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , title=

    Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , title=. NAACL , year=

  14. [14]

    ACL , year=

    Learning Word Vectors for Sentiment Analysis , author=. ACL , year=

  15. [15]

    EMNLP , year=

    SQuAD: 100, 000+ Questions for Machine Comprehension of Text , author=. EMNLP , year=

  16. [16]

    ArXiv , year=

    Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , author=. ArXiv , year=

  17. [17]

    ArXiv , year=

    Making Neural Machine Reading Comprehension Faster , author=. ArXiv , year=

  18. [18]

    ArXiv , year=

    Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , author=. ArXiv , year=

  19. [19]

    ArXiv , year=

    Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System , author=. ArXiv , year=

  20. [20]

    EMNLP-IJCNLP , year=

    Small and Practical BERT Models for Sequence Labeling , author=. EMNLP-IJCNLP , year=

  21. [21]

    ACL , year=

    BAM! Born-Again Multi-Task Networks for Natural Language Understanding , author=. ACL , year=

  22. [22]

    ACL , year=

    Energy and Policy Considerations for Deep Learning in NLP , author=. ACL , year=

  23. [23]

    ArXiv , year=

    Green AI , author=. ArXiv , year=

  24. [24]

    NeurIPS , year=

    Are Sixteen Heads Really Better than One? , author=. NeurIPS , year=

  25. [25]

    ICML , year=

    Deep Learning with Limited Numerical Precision , author=. ICML , year=

  26. [26]

    intel.ai , author=

    Q8BERT, a Quantized 8bit Version of BERT-Base , url=. intel.ai , author=. 2019 , month=

  27. [27]

    2019 , eprint=

    Transformers: State-of-the-art Natural Language Processing , author=. 2019 , eprint=

  28. [28]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018

  29. [29]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  30. [30]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar S. Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke S. Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019

  31. [31]

    Smith and Oren Etzioni , year=

    Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green ai. ArXiv, abs/1907.10597, 2019

  32. [32]

    Energy and policy considerations for deep learning in nlp

    Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In ACL, 2019

  33. [33]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017

  34. [34]

    Transformers: State-of-the-art natural language processing, 2019

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Transformers: State-of-the-art natural language processing, 2019

  35. [35]

    Model compression

    Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, 2006

  36. [36]

    Distilling the Knowledge in a Neural Network

    Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015

  37. [37]

    Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler

    Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pages 19--27, 2015

  38. [38]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2018

  39. [39]

    Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018

  40. [40]

    Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari, Shuning Jin, R

    Alex Wang, Ian F. Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari, Shuning Jin, R. Thomas McCoy, Roma Patel, Yinghui Huang, Jason Phang, Edouard Grave, Najoung Kim, Phu Mon Htut, Thibault F' e vry, Berlin Chen, Nikita Nangia, Haokun Liu, Anhad Mohananey, Shikha Bordia, Nicolas Patry, Ellie Pavlick, and Samuel R. Bowman. jia...

  41. [41]

    Maas, Raymond E

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In ACL, 2011

  42. [42]

    Squad: 100, 000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016

  43. [43]

    Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

    Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from bert into simple neural networks. ArXiv, abs/1903.12136, 2019

  44. [44]

    Making neural machine reading comprehension faster

    Debajyoti Chatterjee. Making neural machine reading comprehension faster. ArXiv, abs/1904.00796, 2019

  45. [45]

    Well-read students learn better: On the impor- tance of pre-training compact models.arXiv preprint arXiv:1908.08962,

    Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. ArXiv, abs/1908.08962, 2019

  46. [46]

    Model compression with multi-task knowledge distillation for web-scale question answering system

    Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, and Daxin Jiang. Model compression with multi-task knowledge distillation for web-scale question answering system. ArXiv, abs/1904.09636, 2019

  47. [47]

    Small and practical bert models for sequence labeling

    Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. Small and practical bert models for sequence labeling. In EMNLP-IJCNLP, 2019

  48. [48]

    Are sixteen heads really better than one? In NeurIPS, 2019

    Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In NeurIPS, 2019

  49. [49]

    Deep learning with limited numerical precision

    Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, 2015