CODA re-expresses most non-attention Transformer computations as GEMM-plus-epilogue programs using a constrained set of composable primitives to keep intermediate results on-chip and cut global memory traffic.
InInternational Conference on Learning Representations
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
BOSCH decomposes attention-head selection for short-context hybridization into layer probing, adaptive ratio assignment, and grouped binary optimization, yielding better efficiency-performance tradeoffs than static or layer-wise baselines.
ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
CCE- is a Triton kernel implementation of cross-entropy loss with negative sampling that reduces memory by more than 10x and accelerates training by up to 2x for large-catalog sequential recommenders.
LongSpec achieves up to 3.26x speedup over Flash Attention baselines on long-context datasets via memory-efficient drafting and verification techniques.
Genome-Factory is an open-source Python library that integrates data pipelines, model tuning, inference, benchmarks, and biological interpretation for genomic foundation models.
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
citing papers explorer
-
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
CODA re-expresses most non-attention Transformer computations as GEMM-plus-epilogue programs using a constrained set of composable primitives to keep intermediate results on-chip and cut global memory traffic.
-
Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
-
BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
BOSCH decomposes attention-head selection for short-context hybridization into layer probing, adaptive ratio assignment, and grouped binary optimization, yielding better efficiency-performance tradeoffs than static or layer-wise baselines.
-
Bayesian Preference Learning for Test-Time Steerable Reward Models
ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
-
Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs
CCE- is a Triton kernel implementation of cross-entropy loss with negative sampling that reduces memory by more than 10x and accelerates training by up to 2x for large-catalog sequential recommenders.
-
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
LongSpec achieves up to 3.26x speedup over Flash Attention baselines on long-context datasets via memory-efficient drafting and verification techniques.
-
Genome-Factory: A Library for Tuning, Deploying, and Interpreting Genomic Foundation Models
Genome-Factory is an open-source Python library that integrates data pipelines, model tuning, inference, benchmarks, and biological interpretation for genomic foundation models.
-
NVILA: Efficient Frontier Visual Language Models
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
- Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation