T iny BERT : Distilling BERT for Natural Language Understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li · 2020 · DOI 10.18653/v1/2020.findings-emnlp.372

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open at publisher browse 8 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Validity-Calibrated Reasoning Distillation

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.

Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

A ReLU-catalyzed abstraction method yields tighter bounds for transformer verification by converting dot-product constraints into ReLU forms that leverage standard convex relaxations.

Distribution Corrected Offline Data Distillation for Large Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.

LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.

Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.

LIMO: Less is More for Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 6.0

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

cs.CL · 2021-12-08 · unverdicted · novelty 6.0

Gopher, a 280 billion parameter language model, achieves state-of-the-art performance on the majority of 152 tasks with largest gains in reading comprehension, fact-checking, and toxic language detection.

Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.

citing papers explorer

Showing 8 of 8 citing papers.

Validity-Calibrated Reasoning Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 75
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement cs.AI · 2026-05-14 · unverdicted · none · ref 14
A ReLU-catalyzed abstraction method yields tighter bounds for transformer verification by converting dot-product constraints into ReLU forms that leverage standard convex relaxations.
Distribution Corrected Offline Data Distillation for Large Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 18
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference cs.LG · 2026-05-01 · unverdicted · none · ref 14
LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.
Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding cs.LG · 2026-05-01 · unverdicted · none · ref 21
Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
LIMO: Less is More for Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 202
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
Scaling Language Models: Methods, Analysis & Insights from Training Gopher cs.CL · 2021-12-08 · unverdicted · none · ref 4
Gopher, a 280 billion parameter language model, achieves state-of-the-art performance on the majority of 152 tasks with largest gains in reading comprehension, fact-checking, and toxic language detection.
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection cs.AI · 2026-05-04 · unverdicted · none · ref 24
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.

T iny BERT : Distilling BERT for Natural Language Understanding

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer