Lossless Compression Improves LLM Inference Efficiency on GPUs

More Efficient GPU Inference of LLMs through Lossless Compression

Large language models (LLMs) pose a challenge for efficient deployment on resource-constrained hardware due to their enormous size. The sheer size of these models limits inference speed and makes deployment on individual GPUs or smaller systems difficult. A new approach to solving this problem is the lossless compression of LLMs, which allows for a reduction in model size while maintaining accuracy.

Dynamic Length Encoding for More Efficient Storage

A promising method in this area is the so-called "Dynamic-Length Float" (DFloat11) compression. This method exploits the low entropy in the BFloat16 weight representation of LLMs. BFloat16, a 16-bit floating-point format, is often used for LLMs to reduce memory requirements and increase training and inference speed. DFloat11 analyzes the frequency distribution of the various weight values in the BFloat16 format and assigns them dynamic length encodings, similar to Huffman coding. Frequently occurring values receive shorter codes, while rarer values receive longer codes. This approach enables near information-optimal compression without loss of accuracy, as the original weight values can be reconstructed exactly.

GPU Kernel for Fast Decompression

To ensure efficient inference with dynamic length encodings, a specialized GPU kernel has been developed for fast online decompression. This kernel is designed to perform the decompression of the weights during inference with minimal latency. The kernel utilizes various optimization techniques, including the decomposition of memory-intensive lookup tables (LUTs) into smaller, SRAM-compatible LUTs, as well as a two-phase architecture to coordinate the read and write positions of the threads. Decompression occurs at the transformer block level to further minimize latency.

Experimental Results and Application Scenarios

Experiments with various LLMs, including Llama-3.1, Qwen-2.5, and Gemma-3, show that DFloat11 achieves a model size reduction of approximately 30% without changing the output compared to the uncompressed model. Compared to the alternative of offloading parts of the uncompressed model to the CPU, DFloat11 achieves significantly higher throughput in token generation. With a fixed GPU memory budget, DFloat11 also enables significantly longer context lengths. Particularly noteworthy is the ability to run extremely large models like Llama-3.1-405B (810 GB) on a single node with 8x80GB GPUs using DFloat11 without CPU offloading and without loss of accuracy.

Outlook and Potential

DFloat11 offers a promising solution for efficient inference of LLMs on GPUs. The lossless compression allows for a significant reduction in memory requirements and an increase in inference speed without compromising accuracy. This opens up new possibilities for the use of LLMs on resource-constrained devices and enables the execution of larger models on existing hardware. Future research could focus on further optimizing the decompression kernel and applying DFloat11 to other model architectures.

```