What an LLM actually does
An LLM does just one thing: it reads some text and predicts which token comes next, then appends that token and predicts again.1 Everything inside the model exists to make that single guess accurate.
What follows traces one sentence through that machinery, stage by stage, in the order the numbers actually flow.
From text to numbers
A model does math on numbers, not text, so the first job is to convert the input into numbers. The piece that does this is the tokenizer: it chops the text into common chunks called tokens and gives each chunk an integer ID drawn from a fixed vocabulary.
A token is usually a fragment of a word rather than a whole one, so "tokenization" may come back as "token" and "ization". Keeping reusable fragments lets a vocabulary of roughly 100,000 entries cover almost any word, including ones absent from training.
Because the model only ever sees these IDs, it has no direct view of the letters inside them. That is why counting the r's in "strawberry" trips it up: the word arrives as a couple of tokens, with no record of the individual letters.
An ID by itself is just an index with no meaning, so each one looks up a row in a large table called the embedding matrix. That row is a vector of a few thousand numbers, and training shapes the table so tokens that behave alike get similar vectors.
"tokenization"
│ split into known chunks
▼
"token" "ization"
│ each chunk → an integer ID
▼
4521 9832
│ each ID → a row of the embedding matrix
▼
[0.21, -1.08, …] [-0.40, 0.92, …]
Those vectors are what people mean when they say a model captures meaning as numbers: tokens used in similar ways sit close together, so the geometry itself encodes that "king" and "queen" are related.
Adding order
The embedding step gives every token a vector for its meaning, but says nothing about where it sat in the sentence. Without that, the model would read "dog bites man" and "man bites dog" as the same bag of tokens, since both hold the same three vectors.
To put order back, modern models use a scheme called RoPE, short for rotary position embeddings. It rotates each token's vector by an angle that depends on its position, so the direction itself encodes where the token sits.
Encoding position as an angle pays off: the gap between two tokens is just the difference in their rotations, so distance reads the same near the start or deep in a document. That partly explains how it handles inputs longer than any it trained on.
Tokens reading each other
Words only mean something in context, so attention adds that context by letting each token read the others. The model builds three vectors per token: a query for what it seeks, a key for what it offers, and a value with the information it will share.
Two tokens match when the dot product of one query and the other key is large. Softmax turns those scores into weights that sum to one, and each token's output is the average of all the values, weighted by them.
In "the cat that I saw yesterday was sleeping", the query from "was" matches the key from "cat", so "was" draws most of its value from "cat" and barely any from the words between them.
query "was" reads back over earlier tokens
the cat that I saw yesterday was
0.04 0.71 0.03 0.02 0.05 0.04 0.11 (sum to 1)
▲
most value comes from "cat", little from the rest
A causal mask blocks the model from reading ahead by forcing the weight of any later token to zero. At each step it can only use tokens already produced or given, so it can't peek at the answer it is meant to predict.
All of this runs in parallel across many attention heads, each with its own query, key, and value projections. The heads pick up different jobs during training: some track grammar, some link pronouns to their nouns, some follow position.
Where facts are stored
After attention mixes in context, each token is sent on its own through a small two-layer network, the feed-forward network. It widens the vector to several times its size, applies a non-linear function such as GELU or SwiGLU, then projects it back down.
This part of the block is where much of the model's stored knowledge lives. Researchers probing trained models have found single neurons that activate for specific concepts, such as the Eiffel Tower or a given programming language.
A technique called ROME shows how literal that storage can be: editing a small set of feed-forward weights moved the model's idea of where the Eiffel Tower stands to Rome, after which it answered follow-up questions as if that were true.
One block, stacked deep
Attention and the feed-forward network together form one transformer block, and a model is mostly that block repeated. Smaller models stack a few dozen, the largest well past a hundred, each copy holding its own trained weights.
Each block adds its result on top of its input instead of overwriting it, and that running sum is the residual stream. It runs unbroken from the first layer to the last, so each block refines what the earlier ones wrote.
tokens
│
▼
│ residual stream
├──► block 1 ──┐
│◄─────────────┘ adds its result back
├──► block 2 ──┐
│◄─────────────┘
⋮ (dozens to 100+ deep)
├──► block N ──┐
│◄─────────────┘
│
▼
prediction
That design is what lets such deep models train at all. The learning signal, the gradient that training nudges weights with, travels back through every layer, and the residual stream gives it a direct path so it doesn't fade to nothing.
A normalization step keeps the numbers in a stable range at each stage, so they don't grow too large or shrink to zero across so many layers. Most current models use a variant called RMSNorm and run it before each block, which trains more stably.
Predicting the next word
Once the residual stream reaches the top of the stack, the model takes the vector at the last position and compares it against the whole vocabulary, producing one raw score, called a logit, for every token that could come next.
Softmax turns those scores into a probability distribution over the vocabulary, and the model samples its next token from it. A temperature setting controls randomness: low values stick to the likeliest tokens, higher ones let rarer ones through.
The chosen token is appended to the input, and the whole process runs again for the token after it. To avoid redoing the work each round, the model caches the keys and values it already computed, so each new step only processes the token just added.
Every big model is the same machine
Walk that pipeline end to end and you have described GPT, Claude, Gemini, and LLaMA at once. They sit in the same transformer family, and recent ones agree on the details: RoPE for position, RMSNorm before each block, SwiGLU in the feed-forward layers.
The largest models add a twist called Mixture-of-Experts: many feed-forward networks in place of one, with each token routed through only a few. DeepSeek-V3 holds 671B parameters but uses about 37B per token, far cheaper to run than its size suggests.
What separates one model from another sits in the weights: the data it trained on, how many layers and how wide it is, and the post-training that shapes its behavior. The wiring underneath barely changes from one model to the next.
This is also why a genuinely new design is hard: anything meant to replace the transformer still has to turn text into vectors, mix context across positions, and predict one token at a time, the same jobs the transformer already handles well.