{"paper":{"title":"FitNets: Hints for Thin Deep Nets","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A deeper but much thinner student network can outperform its larger teacher by using intermediate layer hints during training.","cross_cats":["cs.NE"],"primary_cat":"cs.LG","authors_text":"Adriana Romero, Antoine Chassang, Carlo Gatta, Nicolas Ballas, Samira Ebrahimi Kahou, Yoshua Bengio","submitted_at":"2014-12-19T22:40:51Z","abstract_excerpt":"While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hin"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the added mapping parameters can reliably transfer useful intermediate knowledge from teacher to a much smaller student layer without the extra capacity causing overfitting or unstable training.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"FitNets trains deeper thinner students using teacher intermediate representations as hints plus mapping parameters, yielding a CIFAR-10 student with 10.4x fewer parameters that beats the teacher.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A deeper but much thinner student network can outperform its larger teacher by using intermediate layer hints during training.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"60684e41e36e5149e25cc689a176be470ddf8e46d03f9ec70cdbd45de6cbee81"},"source":{"id":"1412.6550","kind":"arxiv","version":4},"verdict":{"id":"51b0bf8b-5985-4814-b3a2-fdf7890eeb68","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T01:41:27.991454Z","strongest_claim":"For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.","one_line_summary":"FitNets trains deeper thinner students using teacher intermediate representations as hints plus mapping parameters, yielding a CIFAR-10 student with 10.4x fewer parameters that beats the teacher.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the added mapping parameters can reliably transfer useful intermediate knowledge from teacher to a much smaller student layer without the extra capacity causing overfitting or unstable training.","pith_extraction_headline":"A deeper but much thinner student network can outperform its larger teacher by using intermediate layer hints during training."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"d9c0ff73509e7ea2479d29d13139e226c273d36054b129fa0e2ea1185e6392fc"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}