Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.
super hub Mixed citations
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Mixed citation behavior. Most common role is background (62%).
abstract
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge di
authors
co-cited works
representative citing papers
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.
MATCHA introduces a dual-view contrastive metric measuring proximity to gold text and distance from adversarial contradictions, outperforming ROUGE and BERTScore by up to 20% on TruthfulQA and other NLP benchmarks.
PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.
CROC constructs finite-sample valid confidence sets for the root-cause index in multi-stream change detection using conformal p-values under independence and exchangeability assumptions.
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.
DMP-MH clips degrees to control triangle sensitivity, synthesizes an edge-DP graph with Noisy Mirror Descent, and distills it into dual-stream hash networks, beating private baselines by up to 11.4 mAP on MIRFlickr-25K and NUS-WIDE while keeping 92.5% of non-private performance.
Foundation model priors amplify worst-client disparity under extreme federated heterogeneity, creating a fairness paradox where larger models perform worse for disadvantaged clients.
Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.
TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment predicts external adoption metrics.
RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.
Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
citing papers explorer
-
MiniLLM: On-Policy Distillation of Large Language Models
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
-
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
FrugalGPT learns query-specific cascades across heterogeneous LLM APIs to match or exceed top-model accuracy at far lower cost.
-
R3M: A Universal Visual Representation for Robot Manipulation
A visual encoder pre-trained on diverse human videos with contrastive and language objectives improves simulated robot manipulation success by over 20% versus training from scratch and enables real Franka arm tasks from 20 demonstrations.
-
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed
Denoising Student distills the multi-step denoising process of score-based and diffusion models into a single forward pass, matching GAN sampling speed while producing comparable sample quality on CIFAR-10, CelebA, and 256x256 LSUN.
-
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
ALFWorld aligns text-based and embodied visual environments so agents can learn abstract policies in TextWorld that transfer to better performance on ALFRED tasks than visual-only training.
-
Linformer: Self-Attention with Linear Complexity
Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
-
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
-
DOPD: Dual On-policy Distillation
DOPD is an advantage-aware dual distillation method that dynamically assigns token supervision from either privileged teacher or student to transfer capability while mitigating non-replicable information asymmetry in on-policy distillation.
-
SurrogateShield: Beyond Redaction for High-Utility, Privacy-Preserving LLM Interactions
SurrogateShield replaces detected PII with device-local surrogates before LLM API calls and restores originals afterward, achieving 98.87% F1 detection and 13.26 pp higher BERTScore than placeholder redaction while blocking real PII transmission.
-
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
DPVR-LF routes saturated vision tokens into a one-layer side branch after layer 4, runs text-only processing through layers 5-17, and performs late fusion at the final layer to reduce visual computation while preserving multimodal performance.
-
Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral
IndicBERT-HPA with language-aware adapters and verification-guided deferral outperforms baselines on multilingual orthopedic note classification, reaching 0.8792 Macro-F1 overall and 84.4% selective accuracy at 72.3% coverage.
-
SemStruct: Contextualizing Semantic Embeddings with Structural Information for Schema Matching
SemStruct models tables as heterogeneous graphs with GNNs on frozen PLM embeddings to incorporate row co-occurrences for schema matching and reports SOTA results on Valentine and SOTAB-SM benchmarks.
-
BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices
BitTP applies weight-only 1.58-bit quantization to LLM trajectory predictors, claiming improved ADE/FDE over BF16 baseline with reduced resource demands on edge devices.
-
From Learning Resources to Competencies: LLM-Based Tagging with Evidence and Graph Constraints
An LLM+BM25+graph pipeline tags learning resources to competencies with evidence spans, reaching 0.57 micro-F1 and 0.50 macro-F1 at fragment level on a 22-competency university dataset while outperforming baselines.
-
When Mean CE Fails: Median CE Can Better Track Language Model Quality
Median cross-entropy tracks language model task performance more reliably than mean cross-entropy during synthetic fact-learning SFT and top-K distillation.
-
Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers
Grammatically-Guided Sparse Attention uses POS tags to generate hard or soft masks that constrain self-attention, achieving 0.8200 and 0.8165 accuracy on SST-2 versus 0.8200 for full attention in a DistilBERT-like model.
-
FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection
FAME achieves F1 of 98.16 on BGL and 99.95 on Thunderbird for message-level log anomaly detection using at most K=100 labels per template, reducing annotation effort by 76x while detecting anomalies from unseen EventIDs.
-
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation
A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.
-
Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation
DMA-KWS achieves 97.85% AUC and 6.13% EER on LibriPhrase Hard via dual-stage CTC/QbyT matching, multi-modal enrollment, and lightweight continual adaptation with 187k parameters.
-
OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind
OSCToM uses RL-guided generation with an extended DSL and surrogate models to create nested belief conflict tasks, raising FANToM accuracy from 0.2% to 76% while being 6x more efficient.
-
PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling
PromptRad reformulates multi-label radiology report classification as masked language modeling and enriches verbalizers with UMLS synonyms, outperforming baselines with only 32 training examples.
-
Traditional statistical representations outperform generative AI in identifying expert peer reviewers
TF-IDF identifies labeled experts in the top 25 recommendations 79.5% of the time versus 51.5% for GPT-4o mini on an astronomy observatory dataset.
-
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
-
Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning
Fine-tuned transformers with multi-task learning recover substantial wording-derived signal for item difficulty at small sample sizes typical in applied testing.
-
Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult?
Fine-tuned LLM and explainable models predict vocabulary difficulty with correlations r > 0.91 and r > 0.77, showing spelling difficulty and test item construction as key influences in addition to word production difficulty.
-
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.
-
PhishSigma++: Malicious Email Detection with Typed Entity Relations
PhishSigma++ reaches 0.9675 F1 on clean data and holds 0.9579 F1 under adversarial text padding by modeling typed entity relations in emails, outperforming text-only baselines that drop sharply.
-
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
-
Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs
Graph representation learning plus iterative augmented Lagrangian optimization creates stronger, harder-to-detect model manipulation attacks on federated LLM fine-tuning, cutting global accuracy by up to 26%.
-
PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition
PRIMED improves referring audio-visual segmentation by using a modality prior decoder and competition-aware fusion to adaptively suppress irrelevant modalities.
-
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing
NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.
-
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
-
eDySec: A Deep Learning-based Explainable Dynamic Analysis Framework for Detecting Malicious Packages in PyPI Ecosystem
eDySec is a deep learning-based framework that detects malicious PyPI packages through dynamic analysis, halving feature dimensionality, reducing false positives by 82%, false negatives by 79%, and boosting accuracy by 3% with near-perfect stability.
-
G-Loss: Graph-Guided Fine-Tuning of Language Models
G-Loss builds a document-similarity graph and uses semi-supervised label propagation to guide fine-tuning of language models, yielding higher accuracy than standard losses on five classification benchmarks.
-
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
-
ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting
ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.
-
GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology
GIST extracts a semantically annotated 2D navigation topology from consumer mobile point clouds to improve spatial grounding for embodied AI in dense environments.
-
A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM
G-Defense builds claim-centered graphs from sub-claims, applies RAG for evidence and competing explanations, then uses graph inference to detect fake news veracity and generate intuitive explanation graphs, claiming SOTA results.
-
ProtoSiTex: Learning Semi-Interpretable Prototypes for Multi-label Text Classification
ProtoSiTex introduces dual-phase prototype learning with hierarchical consistency loss for semi-interpretable multi-label text classification on a new subsentence-annotated hotel review dataset.
-
A Transformer-Based Cross-Platform Analysis of Public Discourse on the 15-Minute City Paradigm
Benchmarks five compressed transformer models for multi-platform sentiment classification on 15-minute city discourse, reporting DistilRoBERTa highest F1 of 0.8292 and platform-specific performance differences.
-
Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code
Empirical tests show compressed code language models retain task performance but suffer markedly lower robustness under four standard adversarial attacks.
-
Mitigating Watermark Forgery in Generative Models via Randomized Key Selection
Randomized per-query key selection with single-key detection acceptance bounds forgery success rate independently of collected samples while preserving model utility.
-
ALIEN: Aligned Entropy Head for Improving Uncertainty Estimation of LLMs
ALIEN trains a lightweight uncertainty head initialized to model entropy and refined via supervised regularization to improve detection of incorrect predictions and calibration on classification and NER tasks.
-
Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments
A new labeled dataset of 9,969 Israel-Palestine Reddit comments is created and used to compare stance classification methods, with a specific Mixtral prompt achieving the highest performance.
-
Efficient Model Repository for Entity Resolution: Construction, Search, and Integration
MoRER builds an ER model repository via feature distribution clustering of tasks, achieving competitive results with limited labels versus active learning, transfer learning, and self-supervised methods on three multi-source datasets.
-
Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA
A LoRA-based residual feature alignment method for efficient machine unlearning on pre-trained models by targeting zero residuals on retained data and shifted residuals on unlearned data.
-
Remember what you did so you know what to do next
GPT-J with full action history achieves 3.5x improvement over RL in ScienceWorld and matches a two-stage system using 29x larger models.
-
Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models
bLLMs achieve state-of-the-art results on limited and imbalanced SE sentiment datasets even in zero-shot settings, but fine-tuned sLLMs outperform when ample balanced training data is available.
-
From Sentiment to Actionable Insights: A Data-Driven Public Sentiment Analysis of Advanced Air Mobility
Applies standard sentiment classifiers and topic modeling to a large AAM discussion corpus, identifies six clusters of public concern, and lists strategies to address them.
-
Rec-Distill: An Industrial Distillation Pipeline for Large-Scale Recommendation Models
Rec-Distill is an industrial distillation pipeline that transfers substantial performance from large-scale recommendation models to efficient students, reporting over 60% transferability and measurable business gains.