NVIDIA-Certified Associate: Generative AI LLMs

Question 1 of 30

Core Machine Learning and AI Knowledge

In the context of Transformer-based Large Language Models, what is the primary purpose of Key-Value (KV) caching during inference?

To store the weights of the feed-forward network to reduce memory bandwidth requirements.

To cache the computed key and value tensors for past tokens to avoid redundant calculations during autoregressive decoding.

To pre-compute the self-attention scores for future tokens before they are generated by the model.

To compress the vocabulary embeddings into a lower-dimensional space for faster tokenization.