FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping

Aditya Akella; Ajay Jaiswal; Bodun Hu; Lu Yin; Shiwei Liu; Tianlong Chen; Yeonju Ro

arxiv: 2404.03865 · v1 · pith:AUXTTGYUnew · submitted 2024-04-05 · 💻 cs.CL · cs.LG

FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping

Ajay Jaiswal , Bodun Hu , Lu Yin , Yeonju Ro , Shiwei Liu , Tianlong Chen , Aditya Akella This is my paper

classification 💻 cs.CL cs.LG

keywords autoregressivegenerationffn-skipllmlayersllmsacrossblockscache

0 comments

read the original abstract

Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges for autoregressive token-by-token generation. To mitigate computation overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. Despite some promising success due to the redundancy across LLMs layers on metrics like Rough-L/BLUE, our careful knowledge-intensive evaluation unveils issues such as generation collapse, hallucination of wrong facts, and noticeable performance drop even at the trivial exit ratio of 10-15% of layers. We attribute these errors primarily to ineffective handling of the KV cache through state copying during early-exit. In this work, we observed the saturation of computationally expensive feed-forward blocks of LLM layers and proposed FFN-SkipLLM, which is a novel fine-grained skip strategy of autoregressive LLMs. More specifically, FFN-SkipLLM is an input-adaptive feed-forward skipping strategy that can skip 25-30% of FFN blocks of LLMs with marginal change in performance on knowledge-intensive generation tasks without any requirement to handle KV cache. Our extensive experiments and ablation across benchmarks like MT-Bench, Factoid-QA, and variable-length text summarization illustrate how our simple and ease-at-use method can facilitate faster autoregressive decoding.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RAP: Runtime Adaptive Pruning for LLM Inference
cs.LG 2025-05 unverdicted novelty 5.0

RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.