super hub Mixed citations

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Julien Chaumond, Lysandre Debut, Thomas Wolf, Victor Sanh · 2019 · cs.CL · arXiv 1910.01108

Mixed citation behavior. Most common role is background (62%).

171 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 171 citing papers more from Julien Chaumond arXiv PDF

abstract

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 method 11

citation-polarity summary

background 18 use method 11

claims ledger

abstract As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge di

authors

Julien Chaumond Lysandre Debut Thomas Wolf Victor Sanh

co-cited works

representative citing papers

Canonical Regularisation of Wide Feature-Learning Neural Networks

stat.ML · 2026-05-18 · unverdicted · novelty 8.0

Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging

hep-ex · 2026-05-20 · unverdicted · novelty 7.0

PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.

Distribution-free root cause analysis

stat.ME · 2026-05-20 · unverdicted · novelty 7.0

CROC constructs finite-sample valid confidence sets for the root-cause index in multi-stream change detection using conformal p-values under independence and exchangeability assumptions.

Layer-wise Token Compression for Efficient Document Reranking

cs.IR · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs

TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics

cs.OS · 2026-05-18 · unverdicted · novelty 7.0

TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.

Differentially Private Motif-Preserving Multi-modal Hashing

cs.IR · 2026-05-14 · unverdicted · novelty 7.0

DMP-MH clips degrees to control triangle sensitivity, synthesizes an edge-DP graph with Noisy Mirror Descent, and distills it into dual-stream hash networks, beating private baselines by up to 11.4 mAP on MIRFlickr-25K and NUS-WIDE while keeping 92.5% of non-private performance.

When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Foundation model priors amplify worst-client disparity under extreme federated heterogeneity, creating a fairness paradox where larger models perform worse for disadvantaged clients.

Switchcraft: AI Model Router for Agentic Tool Calling

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.

TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.

A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.

Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

astro-ph.GA · 2026-04-28 · unverdicted · novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

cs.AI · 2026-04-27 · conditional · novelty 7.0

AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment predicts external adoption metrics.

RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.

GuardPhish: Securing Open-Source LLMs from Phishing Abuse

cs.CR · 2026-04-19 · unverdicted · novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.

Depth Adaptive Efficient Visual Autoregressive Modeling

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.

SecureRouter: Encrypted Routing for Efficient Secure Inference

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.

citing papers explorer

Showing 18 of 18 citing papers after filters.

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models cs.CR · 2026-04-30 · unverdicted · none · ref 41 · internal anchor
VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
Distribution Corrected Offline Data Distillation for Large Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 32 · internal anchor
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
Unified Approach for Weakly Supervised Multicalibration stat.ML · 2026-05-11 · unverdicted · none · ref 36 · internal anchor
A unified framework uses contamination-matrix risk rewrites and witness-based calibration constraints to estimate and correct multicalibration under weak supervision, providing finite-sample guarantees and the WLMC post-hoc recalibration algorithm.
ImproBR: Bug Report Improver Using LLMs cs.SE · 2026-04-28 · unverdicted · none · ref 29 · internal anchor
ImproBR combines a hybrid detector with GPT-4o mini and RAG to raise bug report structural completeness from 7.9% to 96.4% and executable steps from 28.8% to 67.6% on 139 Mojira reports.
A Comparative Study of Semantic Log Representations for Software Log-based Anomaly Detection cs.SE · 2026-04-09 · unverdicted · none · ref 32 · internal anchor
QTyBERT matches or exceeds BERT-based log anomaly detection effectiveness while reducing embedding generation time to near static word embedding levels.
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control cs.LG · 2024-10-31 · unverdicted · none · ref 42 · internal anchor
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
Retrieval-Augmented Generation for Natural Language Processing: A Survey cs.CL · 2024-07-18 · accept · none · ref 152 · internal anchor
The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.
MiniLLM: On-Policy Distillation of Large Language Models cs.CL · 2023-06-14 · conditional · none · ref 17 · internal anchor
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
PhishSigma++: Malicious Email Detection with Typed Entity Relations cs.CR · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
PhishSigma++ reaches 0.9675 F1 on clean data and holds 0.9579 F1 under adversarial text padding by modeling typed entity relations in emails, outperforming text-only baselines that drop sharply.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection cs.AI · 2026-05-04 · unverdicted · none · ref 38 · internal anchor
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
eDySec: A Deep Learning-based Explainable Dynamic Analysis Framework for Detecting Malicious Packages in PyPI Ecosystem cs.CR · 2026-04-29 · unverdicted · none · ref 54 · internal anchor
eDySec is a deep learning-based framework that detects malicious PyPI packages through dynamic analysis, halving feature dimensionality, reducing false positives by 82%, false negatives by 79%, and boosting accuracy by 3% with near-perfect stability.
A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM cs.CL · 2026-04-08 · unverdicted · none · ref 59 · internal anchor
G-Defense builds claim-centered graphs from sub-claims, applies RAG for evidence and competing explanations, then uses graph inference to detect fake news veracity and generate intuitive explanation graphs, claiming SOTA results.
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF cs.CL · 2026-05-05 · unverdicted · none · ref 20 · 2 links · internal anchor
The work provides a reproducible, session-based guide to the NLP pipeline with original adaptations and resources for morphologically rich low-resource languages.
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices cs.DC · 2025-03-11 · unverdicted · none · ref 70 · internal anchor
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.
Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook cs.LG · 2024-06-10 · unverdicted · none · ref 80 · internal anchor
A literature survey reviewing traditional diagnostics, AI-driven studies, and explainable AI models for mental disorder detection via online social media, including datasets, evaluation practices, issues, and future directions.
Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces cs.LG · 2026-05-01 · unreviewed · ref 40 · internal anchor
Hierarchical Fault Detection and Diagnosis for Transformer Architectures cs.SE · 2026-04-30 · unreviewed · ref 31 · internal anchor

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer