Self-Speculative Decoding: Accelerating Large Language Models

Faster Language Models: Self-Speculative Decoding for Accelerating LLMs

Large language models (LLMs) have revolutionized the world of Artificial Intelligence. Their ability to generate human-like text opens up unprecedented possibilities in various fields, from automated text production to advanced chatbots. However, the impressive performance of LLMs comes at a price: high computational costs and long inference times. A promising approach to address this challenge is self-speculative decoding. Self-speculative decoding follows a two-stage approach. In the first stage, the "Drafting-Phase", a simplified model, often created by selectively omitting layers of the original LLM, quickly generates a series of token candidates. These candidates serve as a "draft" for the actual text output. In the second stage, the "Verification-Phase", the full LLM verifies the validity of the proposed tokens in a single pass. This parallel verification process is the key to acceleration. Only if the proposed tokens are confirmed by the LLM are they accepted as the final output. Otherwise, the LLM generates the correct tokens itself, which requires additional computation time but ensures the quality of the output. The advantage of this method lies in the combination of speed and accuracy. The Drafting-Phase benefits from the faster inference of the simplified model, while the Verification-Phase ensures that the final output matches the quality of the full LLM. Another advantage is the simple implementation. Self-speculative decoding does not require additional training of neural networks and does not need additional storage space. It is thus a cost-effective and easy-to-integrate solution for accelerating LLMs. Current research results show promising outcomes. Benchmarks with LLaMA-2 and other models demonstrate an acceleration of inference by up to a factor of 2. Furthermore, researchers are working on further developments of the method, such as adaptive draft length adjustment. Instead of generating a fixed number of tokens in the Drafting-Phase, the length of the draft is dynamically adjusted to the complexity of the input. This reduces the likelihood of errors in the Drafting-Phase and thus minimizes the need for corrections by the full LLM. Self-speculative decoding is a promising approach for accelerating LLMs, which has the potential to enable the application of these powerful models in real-time applications and resource-constrained environments. Ongoing research in this area promises further improvements and optimizations that will further increase the performance and efficiency of LLMs. Bibliography: - Zhang, J., Wang, J., Li, H., Shou, L., Chen, K., Chen, G., & Mehrotra, S. (2024). Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding. *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)*, 8669–8684. - GuiLarge, L., et al. (2024). Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation. *arXiv preprint arXiv:2405.19715*. - Elhoushi, M., et al. (2024). PEARL: Parallel Speculative Decoding with Adaptive Draft Length. *ICLR 2025 Conference Submission*. - Anonymous. (2024). SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration. *ICLR 2025 Conference Submission*. - Kim, Y., et al. (2024). Accelerating Blockwise Parallel Language Models with Draft Refinement. *OpenReview*. - Gloeckle, A., et al. (2024). Better & Faster Large Language Models via Multi-token Prediction. - Stern, M., et al. (2018). Blockwise Parallel Decoding for Deep Autoregressive Models. - LiuXiaoxuanPKU. (2024). [RFC]: Automate Speculative Decoding. *vllm-project/vllm, GitHub*. - Spuler, D. (2024). Speculative Decoding: Types and Optimizations. *Aussie AI*. - Bayes Labs. (2024, October 26). 🚀Research Paper Highlights: Let's explore- 🚀 Accelerating LLM Inference with Lossless Speculative Decoding 🚀in 'Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation' by Lujun GuiLarge et al. [LinkedIn post]. *LinkedIn*.

Self-Speculative Decoding: Accelerating Large Language Models

Faster Language Models: Self-Speculative Decoding for Accelerating LLMs

Start for free now and experience the power of AI-driven knowledge management.