Why Do LLMs Generate Different Responses?
We can start to explain the variations in LLM responses by revisiting one of the earliest models from OpenAI: the Generative Pre-trained Transformer (GPT), introduced in their seminal paper [1].
(Below is the GPT model architecture as shown in that paper.)
Introduction
The more we work with LLMs and Generative AI, the more we notice how hard it is to get the exact same response twice. While the architectures of most proprietary models aren't public, many of them build on the transformer decoder architecture.
This decoder is autoregressive, meaning it generates one word at a time by predicting the next word based on previous ones — one reason why we refer to LLMs as generative AI. While attention mechanisms have evolved beyond what was originally introduced [3], we won’t dive into the variants here.
Probabilities with softmax
In open-weight models like GPT, transformer layers rely heavily on attention. These attention layers compute weighted sums of inputs and pass them through a softmax function — turning scores into probability distributions.
As shown in the equation below, the attention score is computed using Query and Key vectors, which are then used to weight the Value vector. This allows the model to understand the order and relationships between words.
Multi-head attention runs this operation in parallel across several attention "heads."
(An illustration of this is shown below [3].)
In GPT-1, this transformer block was stacked 12 times, giving the model a hierarchical ability to build complex sentence structures. Despite being small by today’s standards (only ~115 million parameters), it was an important breakthrough.
Still, we don’t fully understand how all these layers and probabilities mix to form what we perceive as “knowledge.”
The Final Trick: Output Softmax
LLMs also apply softmax in the output layer — to turn the model's final scores into word predictions. This is why many LLM frameworks offer a top_k or temperature parameter: to control how many potential answers to sample from.
Stochastic Optimization with Adam
During training, the model’s parameters are updated using gradient descent, often via Adam, a popular stochastic optimizer [2]. Adam applies a moving average over minibatches of samples, introducing randomness at every step.
This randomness — from both data sampling and weight updates — contributes to a model that generalizes across different tasks and prompts.
From Open Models to Closed Ones
OpenAI later released GPT-2, which was also open-source. But starting with GPT-3, model weights were no longer shared publicly. GPT-3 became the backbone of the first ChatGPT release.
Summary: What Drives Variation in LLM Outputs
Here are three key reasons why LLMs generate different responses:
LLMs are not simple statistical models
While language is modeled as a conditional probability distribution, LLMs are neural networks with billions of parameters — far beyond traditional statistical models.Softmax introduces controlled randomness
Both in the attention layers and in the output, softmax plays a crucial role in weighting and sampling from possible sequences.Training is inherently stochastic
The optimization process includes randomness (via optimizers like Adam), meaning that even a single LLM might respond differently each time — and two similar models might behave differently.
Have you noticed unexpected variations in LLM outputs? How do you handle that in production? 👇
📚 References:
[1] Improving Language Understanding by Generative Pre-Training (OpenAI)