super hub Mixed citations

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Julien Chaumond, Lysandre Debut, Thomas Wolf, Victor Sanh · 2019 · cs.CL · arXiv 1910.01108

Mixed citation behavior. Most common role is background (62%).

185 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 185 citing papers more from Julien Chaumond arXiv PDF

abstract

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 method 11

citation-polarity summary

background 18 use method 11

claims ledger

abstract As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge di

authors

Julien Chaumond Lysandre Debut Thomas Wolf Victor Sanh

co-cited works

representative citing papers

Canonical Regularisation of Wide Feature-Learning Neural Networks

stat.ML · 2026-05-18 · unverdicted · novelty 8.0

Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

MATCHA: Matching Text via Contrastive Semantic Alignment

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

MATCHA introduces a dual-view contrastive metric measuring proximity to gold text and distance from adversarial contradictions, outperforming ROUGE and BERTScore by up to 20% on TruthfulQA and other NLP benchmarks.

Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging

hep-ex · 2026-05-20 · unverdicted · novelty 7.0

PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.

Distribution-free root cause analysis

stat.ME · 2026-05-20 · unverdicted · novelty 7.0

CROC constructs finite-sample valid confidence sets for the root-cause index in multi-stream change detection using conformal p-values under independence and exchangeability assumptions.

Layer-wise Token Compression for Efficient Document Reranking

cs.IR · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs

TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics

cs.OS · 2026-05-18 · unverdicted · novelty 7.0

TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.

Differentially Private Motif-Preserving Multi-modal Hashing

cs.IR · 2026-05-14 · unverdicted · novelty 7.0

DMP-MH clips degrees to control triangle sensitivity, synthesizes an edge-DP graph with Noisy Mirror Descent, and distills it into dual-stream hash networks, beating private baselines by up to 11.4 mAP on MIRFlickr-25K and NUS-WIDE while keeping 92.5% of non-private performance.

When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Foundation model priors amplify worst-client disparity under extreme federated heterogeneity, creating a fairness paradox where larger models perform worse for disadvantaged clients.

Switchcraft: AI Model Router for Agentic Tool Calling

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.

TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.

A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.

Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

astro-ph.GA · 2026-04-28 · unverdicted · novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

cs.AI · 2026-04-27 · conditional · novelty 7.0

AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment predicts external adoption metrics.

RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.

GuardPhish: Securing Open-Source LLMs from Phishing Abuse

cs.CR · 2026-04-19 · unverdicted · novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.

citing papers explorer

Showing 35 of 185 citing papers.

ViASNet: A Video Ad Saliency Network for Predicting Dynamic Saliency and Viewer Engagement cs.CV · 2026-05-28 · unverdicted · none · ref 10 · internal anchor
ViASNet applies a 3D U-Net architecture augmented with audio and semantic inputs to predict dynamic saliency in video ads and uses frame-wise entropy to diagnose low-engagement scenes on eye-tracked data from 151 ads.
Does Continued Pretraining on a Learner Corpus Improve Automated Essay Scoring on English Proficiency Tests? Evidence from EFCAMDAT cs.CL · 2026-05-25 · unverdicted · none · ref 2 · internal anchor
Continued pretraining on EFCAMDAT yields mixed AES results on FCE and IELTS; targeted CEFR-aligned subsets improve in-domain scoring more reliably than full-corpus pretraining but do not consistently aid cross-dataset transfer.
Tracing the ongoing emergence of human-like reasoning in Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 68 · internal anchor
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
Fortress: A Case Study in Stabilizing Search Recommendations via Temporal Data Augmentation and Feature Pruning cs.IR · 2026-05-14 · unverdicted · none · ref 12 · internal anchor
Fortress stabilizes query-to-app relevance models by pruning features that cause inconsistent predictions across time periods while retaining predictive power from engagement signals.
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models cs.SE · 2026-04-28 · unverdicted · none · ref 45 · internal anchor
CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
Knowledge Distillation Must Account for What It Loses cs.LG · 2026-04-28 · unverdicted · none · ref 3 · 2 links · internal anchor
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
Understanding Communication Backends in Cross-Silo Federated Learning cs.DC · 2026-04-12 · unverdicted · none · ref 13 · internal anchor
Benchmarks of MPI, gRPC, and PyTorch RPC in cross-silo FL plus a new gRPC+S3 hybrid backend deliver up to 3.8x speedup for large-model transmission under realistic network conditions.
Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition cs.AI · 2026-04-06 · unverdicted · none · ref 27 · internal anchor
A cosine-similarity metric on SHAP feature attributions is proposed to quantify explanation stability for same-label inputs under perturbations in transformer-based sentiment classifiers.
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition cs.SE · 2026-04-03 · conditional · none · ref 84 · internal anchor
Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid cs.CY · 2025-11-06 · unverdicted · none · ref 33 · 2 links · internal anchor
G-TRACE provides region-aware estimates of GenAI carbon emissions including 4309 MWh and 2068 tCO2 for a 2024-2025 image generation trend, paired with a seven-level AI Sustainability Pyramid for policy guidance.
Framing Unionization on Facebook: Communication around Representation Elections in the United States cs.CY · 2025-10-02 · unverdicted · none · ref 11 · internal anchor
Union Facebook posts predominantly use diagnostic and community frames; pre-election emphasis on diagnostic, prognostic, and community frames correlates with higher election success, with post-election frame usage diverging by outcome.
An Improved Quantum Software Challenges Classification Approach using Transfer Learning and Explainable AI cs.SE · 2025-09-25 · conditional · none · ref 47 · internal anchor
Transfer learning with BERT models classifies quantum software engineering challenges from Stack Overflow posts into six categories at 95% average accuracy, outperforming traditional ML baselines by 6% and adding SHAP interpretability.
Towards the Anonymization of the Language Modeling cs.CL · 2025-01-05 · unverdicted · none · ref 44 · internal anchor
Authors introduce MLM and CLM specialization methods that avoid memorizing identifiers in sensitive training data while aiming for a privacy-utility tradeoff on medical datasets.
Do Sentence Transformers Learn Quasi-Geospatial Concepts from General Text? cs.CL · 2024-04-05 · unverdicted · none · ref 14 · internal anchor
Sentence transformers show partial zero-shot ability to link route descriptions with hiking queries, indicating some grasp of quasi-geospatial concepts like type and difficulty.
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models cs.CL · 2024-01-02 · accept · none · ref 59 · internal anchor
A survey that compiles and taxonomizes more than 32 existing hallucination mitigation techniques for LLMs while analyzing their challenges and limitations.
Little Brains, Big Feats: Exploring Compact Language Models cs.CL · 2026-06-29 · unverdicted · none · ref 27 · internal anchor
Small language models can run RAG generation on-device without GPUs in reasonable time.
Memory-Efficient Partitioned DNN Inference on Resource-Constrained Android Crowds cs.LG · 2026-05-20 · unverdicted · none · ref 13 · internal anchor
CROWDio enables memory-efficient ONNX inference of DistilBERT on Android handsets by partitioning across devices with JIT loading, affinity scheduling, compressed transport and streaming, keeping per-device memory at 43 MB and cutting latency 34%.
Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models cs.LG · 2026-05-14 · unverdicted · none · ref 41 · internal anchor
Empirical tests on 118 transformers show success falling from 88.1% at 512 tokens to 0% at 2048 tokens, with compressed models achieving 649.2 tokens/sec/M parameters versus 12.5 for large generative ones.
SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection cs.CL · 2026-04-29 · unverdicted · none · ref 20 · internal anchor
A multi-head RoBERTa model with overlapping chunking and max-pooling achieves Macro-F1 of 0.80 on 3-way clarity classification and 0.51 on 9-way evasion strategy detection, ranking 11th in both subtasks of SemEval-2026 Task 6.
Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation cs.LG · 2025-01-03 · unverdicted · none · ref 47 · internal anchor
CTGAN and LLMs generate synthetic student data that passes statistical and predictive utility checks for learning analytics.
SleepNet and DreamNet: Enriching and Reconstructing Representations for Consolidated Visual Classification cs.LG · 2024-09-03 · unverdicted · none · ref 36 · internal anchor
SleepNet and DreamNet enrich visual features via supervised pre-trained encoders and reconstruct hidden states with encoder-decoder frameworks to outperform prior state-of-the-art classifiers.
A Survey on Knowledge Distillation of Large Language Models cs.CL · 2024-02-20 · accept · none · ref 75 · internal anchor
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
DistilledGemma: Balanced Efficiency-Accuracy for Person-Place Relation Extraction from Multilingual Historical Articles cs.CL · 2026-06-28 · unverdicted · none · ref 21 · internal anchor
DistilledGemma uses prompt engineering, QLoRA fine-tuning on a large teacher, and response-level distillation to a small student, ranking 3rd and 2nd in a 2026 historical relation extraction shared task while keeping the deployed model at ~2.3B parameters.
A Lightweight Hybrid Transformer-CRF Architecture for Multi-Type Bangla Medical Entity Recognition cs.CL · 2026-05-25 · unverdicted · none · ref 14 · internal anchor
A distilled and quantized 4-layer BanglaBERT-CRF model delivers 8.6x CPU speedup and 48% less storage than the 12-layer teacher for Bangla medical entity recognition.
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF cs.CL · 2026-05-05 · unverdicted · none · ref 20 · 2 links · internal anchor
The work provides a reproducible, session-based guide to the NLP pipeline with original adaptations and resources for morphologically rich low-resource languages.
Sentiment Analysis of AI Adoption in Indonesian Higher Education Using Machine Learning and Transformer-Based Models cs.CL · 2026-04-30 · unverdicted · none · ref 5 · internal anchor
DistilBERT achieves 84.78% accuracy and 84.75% F1-score on binary sentiment classification of Indonesian student opinions about AI in higher education, outperforming SVM at 82.14%.
Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices cs.DC · 2025-03-11 · unverdicted · none · ref 70 · internal anchor
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.
Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook cs.LG · 2024-06-10 · unverdicted · none · ref 80 · internal anchor
A literature survey reviewing traditional diagnostics, AI-driven studies, and explainable AI models for mental disorder detection via online social media, including datasets, evaluation practices, issues, and future directions.
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing cs.CV · 2026-05-20 · unreviewed · ref 86 · internal anchor
Post-Trained MoE Can Skip Half Experts via Self-Distillation cs.LG · 2026-05-18 · unreviewed · ref 34 · internal anchor
Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces cs.LG · 2026-05-01 · unreviewed · ref 40 · internal anchor
Hierarchical Fault Detection and Diagnosis for Transformer Architectures cs.SE · 2026-04-30 · unreviewed · ref 31 · internal anchor
Adaptive Head Budgeting for Efficient Multi-Head Attention cs.LG · 2026-04-24 · unreviewed · ref 5 · internal anchor
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations cs.CL · 2025-11-09 · unreviewed · ref 50 · internal anchor
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning cs.CL · 2025-09-26 · unreviewed · ref 14 · internal anchor

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer