TCL delivers 16.8x faster tuning on CPU and 12.48x on GPU with modestly lower inference latency by combining RDU active sampling, a lightweight Mamba cost model, and cross-platform continual knowledge distillation.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning
TCL delivers 16.8x faster tuning on CPU and 12.48x on GPU with modestly lower inference latency by combining RDU active sampling, a lightweight Mamba cost model, and cross-platform continual knowledge distillation.