LLM-ODE integrates large language models into genetic programming to guide symbolic search for governing equations of dynamical systems, outperforming classical GP on 91 test cases in efficiency and solution quality.
Well-read students learn better: The impact of student initialization on knowledge distillation
8 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
DiffuSeq adapts diffusion models to conditional sequence-to-sequence text generation and reports performance matching or exceeding strong baselines including pretrained language model systems while generating more diverse outputs.
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Heterogeneous visual agents form shared symbols via decentralized Metropolis-Hastings captioning, where encoder similarity shapes the content and symmetry of the resulting language.
Spectrum-adaptive post-hoc generalization bounds for multi-layer Transformers are derived using layerwise Schatten quantities whose indices are chosen after training based on singular-value profiles.
In ridgeless regression with low intrinsic dimension, discrepancy between weak and strong models reduces W2S generalization variance by dim(V_s)/N in the discrepant subspace while inheriting it in the overlap.
Denoising Student distills the multi-step denoising process of score-based and diffusion models into a single forward pass, matching GAN sampling speed while producing comparable sample quality on CIFAR-10, CelebA, and 256x256 LSUN.
DistilBERT compresses BERT by 40% via pre-training distillation with a triple loss, retaining 97% performance and running 60% faster.
citing papers explorer
-
LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models
LLM-ODE integrates large language models into genetic programming to guide symbolic search for governing equations of dynamical systems, outperforming classical GP on 91 test cases in efficiency and solution quality.
-
DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models
DiffuSeq adapts diffusion models to conditional sequence-to-sequence text generation and reports performance matching or exceeding strong baselines including pretrained language model systems while generating more diverse outputs.
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
-
Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning
Heterogeneous visual agents form shared symbols via decentralized Metropolis-Hastings captioning, where encoder similarity shapes the content and symmetry of the resulting language.
-
Spectrum-Adaptive Generalization Bounds for Trained Deep Transformers
Spectrum-adaptive post-hoc generalization bounds for multi-layer Transformers are derived using layerwise Schatten quantities whose indices are chosen after training based on singular-value profiles.
-
Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension
In ridgeless regression with low intrinsic dimension, discrepancy between weak and strong models reduces W2S generalization variance by dim(V_s)/N in the discrepant subspace while inheriting it in the overlap.
-
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed
Denoising Student distills the multi-step denoising process of score-based and diffusion models into a single forward pass, matching GAN sampling speed while producing comparable sample quality on CIFAR-10, CelebA, and 256x256 LSUN.
-
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
DistilBERT compresses BERT by 40% via pre-training distillation with a triple loss, retaining 97% performance and running 60% faster.