Quantized KV Cache¶
FP8 KV Cache Overview¶
Efficient memory usage is crucial for working with large language models. Quantizing the KV (Key-Value) cache to FP8 format can significantly reduce its memory footprint. This optimization enables you to store more tokens in memory, leading to improved throughput and support for longer context windows.
Note: When using the Flash Attention 3 backend with FP8 KV cache, attention operations are also performed in the quantized (FP8) domain. In this configuration, queries are quantized to FP8 in addition to keys and values.
Supported FP8 KV-Cache Quantization Schemes¶
vLLM supports two main quantization strategies for the FP8 KV-cache:
- Per-tensor quantization:
A single scale is applied for each Q, K, and V tensor individually. (q/k/v_scale = [1]) - Per-attention-head quantization:
Each scale corresponds to an attention head:q_scale = [num_heads],k/v_scale = [num_kv_heads].
Note:
Per-attention-head quantization is currently available only with the Flash Attention backend and requires the calibration pathway provided by llm-compressor.
Scale Calibration Approaches¶
You can configure how the quantization scales are computed in vLLM using three different approaches:
-
No calibration (default scales):
All quantization scales are set to1.0.
Configure with:
-
Random token calibration (on-the-fly):
Scales are automatically estimated from a single batch of random tokens during warmup and then fixed.
Configure with:
-
[Recommended] Calibration with a dataset (via llm-compressor):
Scales are estimated using a curated calibration dataset for maximum accuracy.
This requires the llm-compressor library.
See example below!
Additional kv_cache_dtype Options¶
kv_cache_dtype="auto": Use the model's default data typekv_cache_dtype="fp8_e4m3": Supported on CUDA 11.8+ and ROCm (AMD GPUs)kv_cache_dtype="fp8_e5m2": Supported on CUDA 11.8+
Examples¶
1. No Calibration (kv_cache_dtype="fp8", calculate_kv_scales=False)¶
All quantization scales are set to 1.0.
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.7, top_p=0.8)
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
kv_cache_dtype="fp8",
calculate_kv_scales=False,
)
prompt = "London is the capital of"
out = llm.generate(prompt, sampling_params)[0].outputs[0].text
print(out)
2. Random Token Calibration (kv_cache_dtype="fp8", calculate_kv_scales=True)¶
Scales are automatically estimated from a single batch of tokens during warmup.
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.7, top_p=0.8)
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
kv_cache_dtype="fp8",
calculate_kv_scales=True,
)
prompt = "London is the capital of"
out = llm.generate(prompt, sampling_params)[0].outputs[0].text
print(out)
3. [Recommended] Calibration Using a Dataset (with llm-compressor)¶
For the highest-quality quantization, we recommend calibrating against a dataset using llm-compressor. This enables advanced strategies such as per-attention-head quantization.
Install the required package¶
Example: Quantize Llama Attention & KV Cache to FP8¶
"""
Quantize Llama attention + KV cache to FP8 (choose either 'tensor' or 'attn_head' strategy)
using llm-compressor one-shot calibration.
"""
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs
# -----------------------------
# Config
# -----------------------------
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
STRATEGY = "tensor" # or "attn_head"
NUM_CALIB_SAMPLES = 512 # Good starting value
MAX_SEQ_LEN = 2048
# -----------------------------
# Helpers
# -----------------------------
def process_and_tokenize(example, tokenizer: AutoTokenizer):
"""Convert chat messages to tokens."""
text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
return tokenizer(
text,
padding=False,
max_length=MAX_SEQ_LEN,
truncation=True,
add_special_tokens=False,
)
def build_recipe(strategy: str) -> QuantizationModifier:
fp8_args = QuantizationArgs(num_bits=8, type="float", strategy=strategy)
return QuantizationModifier(
config_groups={
"attention": QuantizationScheme(
targets=["LlamaAttention"], # Quantize queries: q_scale
input_activations=fp8_args,
)
},
kv_cache_scheme=fp8_args, # Quantize KV cache: k/v_scale
)
# -----------------------------
# Main
# -----------------------------
def main():
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIB_SAMPLES}]")
ds = ds.shuffle(seed=42)
ds = ds.map(
lambda ex: process_and_tokenize(ex, tokenizer),
remove_columns=ds.column_names,
)
recipe = build_recipe(STRATEGY)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQ_LEN,
num_calibration_samples=NUM_CALIB_SAMPLES,
)
save_dir = f"{MODEL_ID.rstrip('/').split('/')[-1]}-kvattn-fp8-{STRATEGY}"
model.save_pretrained(save_dir, save_compressed=True)
tokenizer.save_pretrained(save_dir)
if __name__ == "__main__":
main()
For more detailed and up-to-date examples, see the llm-compressor official examples.