Transformers and Attention: A Plain-English Guide for Engineers
If you build software today, you almost certainly use models built on the transformer architecture, often as a black box. This is a look inside, no heavy math required.
Pictures over equations, small numbers over notation. (Written while working through the Deep Atlas program.)
TLDR
- Each word becomes a row of numbers.
- The model moves and bends those numbers, step by step, to expose useful patterns.
- Attention lets every word adjust its meaning based on the other words around it.
- The final numbers become a probability for every possible next word.
- Pick a word, add it to the sentence, repeat.
Everything below is just zooming into one of those five steps. We will use a single running example the whole way through: the prompt "this too shall", which we are hoping the model completes with "pass".
First, words become numbers
Computers work on numbers, not words. So the first thing any language model does is turn each word into a row of numbers, called a vector. Stack those rows and you get a grid of numbers, which in ML we call a tensor. (Strictly, models split text into tokens, which are often subword pieces rather than whole words. I'll say "word" throughout for readability; swap in "token" wherever you want precision.)
These are ordinary decimal numbers, like 0.2 or -1.3, not the binary 0s and 1s the machine stores underneath. A word like "this" gets assigned its own row of decimals, and all the math from here runs on those.
Take "this too shall". Three words, so three rows. Say four numbers per word. That gives a tensor of shape 3×4, meaning three rows and four columns. The four is arbitrary here; real models use hundreds or thousands of numbers per word.
Before the model runs its layers, those numbers are close to a raw lookup with no sense of context. The model's entire job, layer by layer, is to reshape them until the last row is a good answer to the question "what word comes next?" It helps to picture each row as a single point sitting in space. The rest of this guide is about how those points get moved.
A layer just moves points
A neural network is built from layers, and each layer does something plainer than it sounds: it takes a set of points and moves them to new positions.
Multiply your points by a grid of numbers (a weight matrix) and they land somewhere new. That operation is a linear projection. It has one important limit. A linear projection can only move points along straight lines. It can stretch, rotate, and shear the whole space, but every point follows the same kind of straight path.
Two things worth heading off here. Nothing gets cherry-picked: the same matrix applies to every point at once. And the numbers inside that matrix are not chosen by hand. They start random and get learned during training, nudged over and over toward values that move points somewhere useful (more on how, later).
There is a catch worth knowing early. If you only ever stack linear projections, stacking buys you nothing. Two of them, or ten, collapse into a single equivalent projection. The math of multiplying matrices guarantees it. So a "deep" pile of pure linear layers is no deeper than one layer. It is one move wearing a trench coat.
The bend that makes depth worth it
To make depth pay off, you add a nonlinearity between the linear moves. The classic one is ReLU, short for rectified linear unit, and it is about as simple as a rule gets: keep positive numbers, turn negatives into zero. No calculus needed to read it:
ReLU(2.3) = 2.3 → positive, left alone
ReLU(-1.7) = 0 → negative, snapped to zero
ReLU(0.4) = 0.4 → positive, left alone
A good way to hold the whole pattern is three words: project, warp, project. The linear step slides points in straight lines. The nonlinearity bends the space, changing the distances between points in ways straight lines never could. That bend is the entire reason a deep network can learn things a shallow one cannot.
Here is why the bend earns its place. Without it, a layer can only re-describe the data, never genuinely reshape it. You get a different view of the same information. With it, the model can pull two points apart, push two others together, and actually change what the data means for the next layer. That is the gap between rearranging and learning.
Why ReLU and not something else? Nothing sacred about it. It zeros negatives, which is easy to picture and very cheap to compute, so it scales well. Other choices behave differently: GELU smooths the hard corner instead of snapping straight to zero, and gated variants like SwiGLU go further still. Most modern transformers actually use these rather than plain ReLU; I'm using ReLU here because it is the easiest to picture. The one firm requirement is that the function stay consistent, the same rule every time, so training has something stable to learn against. A random warp would give the rest of the network nothing to lean on.
One honest caveat about all this. Nobody can tell you, in plain English, what a specific layer is "doing". A few layers in, the numbers barely resemble words anymore. An entire research field, interpretability, exists to reverse-engineer what particular weights latched onto, and even its wins tend to be narrow: true for one model, on one input, at one moment. For our purposes the takeaway is clean. Each layer is given room to transform the data, and that freedom is what lets the model learn at all.
From numbers to a next-word guess
The stack of linear-and-bend layers we have been describing has a name: a multi-layer perceptron, or MLP. Here is the full path a prompt takes through a minimal model:
input tensor → MLP → language modeling head → softmax → probabilities → (sample)
Walking that left to right:
- input tensor: the rows of numbers for the words so far, exactly what we built above.
- MLP: the move-and-bend stack that reshapes those numbers.
- language modeling head: a final linear projection that turns each row into one raw score for every word in the vocabulary (often tens of thousands of words).
- softmax: squashes those raw scores into probabilities that add up to 100%, one per possible next word.
- probabilities → sample: pick a word from that distribution and add it to the sentence. Sampling sits in parentheses because, as we'll see, training stops just before it.
One note on softmax, since the precision matters: this step is sometimes loosely called "sigmoid". Sigmoid answers a single yes/no question. Softmax spreads one budget of 100% across many options, which is exactly what you want when choosing one word out of a whole vocabulary.
A few details worth pinning down:
Matrix shapes have to line up, and only the inner numbers need to match. A 3×4 input times a 4×8 weight gives you 3×8. The nonlinearity does not care about shape; it acts on one number at a time. Then you project back to the original shape so the next layer can pick up where this one left off. So 3×8 times 8×4 returns you to 3×4. In practice you rarely manage this by hand, since frameworks like PyTorch keep the dimensions consistent for you. Worth understanding once so the shapes never surprise you.
A subtlety about training: you do not actually sample during it. Sampling, the act of picking a word, is random, and random steps cannot be differentiated, so there is no gradient to learn from. Training stops at the probability distribution and measures the error there.
Training also ends for practical reasons. Either the model starts doing worse on data it has not seen (overfitting, a sign it is memorizing rather than learning), or you run out of compute budget. Usually one of those two.
Why a plain MLP wasn't enough
An MLP alone can, in principle, model language. The problem is efficiency.
Through the early 2010s, deep learning was transforming computer vision while language stubbornly resisted. You could pour compute into a pure MLP and it would still fail to converge on good language modeling. Learning every way words relate to each other, from scratch and with no help, is astronomically hard.
The reason is structural. In a plain setup, each word is processed in isolation. In "this too shall", nothing lets "shall" adjust the meaning of "this", or lets "this" shade "too". Yet language is almost entirely about context. The same word means different things depending on its neighbors.
So researchers at Google, starting with the translation team, reframed the question. Not "how do we make the MLP bigger" but "how do we make its job easier?" Their answer, introduced in the 2017 paper Attention Is All You Need (the work that put attention at the center of language models, though the idea itself is older), was to hand the MLP context-weighted embeddings: the same words, each one already adjusted by the words around it.
Attention: letting words shape each other
Here is the mental image that makes attention click. Picture each word as a star dropped into the night sky. At first they are scattered points with no relationship. Attention gives every star a gravitational pull on every other star. Heavier words tug harder, and the whole arrangement settles into a shape that reflects this particular sentence instead of random positions.
Concretely: if you can measure how much each word should pull on every other word, you can fold those pulls back into the original numbers. Now each word's row carries two things at once. What the word is, and how it relates to its neighbors. That second part is exactly what the MLP was missing.
So how do you record "how much every word affects every other word"? It is a graph. Each word a node, each influence an edge. The standard way to write a graph as numbers is an adjacency matrix, a grid where the cell at a given row and column holds the strength of one connection. The attention matrix is precisely that. The top-left cell is the effect of "this" on itself, a cell further along is the effect of one word on another, and so on across the grid.
Two flavors: causal and full attention
Production language models add one important constraint: they do not fill in the whole grid.
The reason is more than thrift. If the model is learning to predict the next word, it must never peek at words that come later, or it would be cheating during training. So causal attention zeros out the top-right half of the grid. The effect of "shall" on "this" is simply dropped, because "shall" comes later. Roughly half the grid falls away, which is cheaper as a bonus. Every text-generating model you use, GPT and Claude included, works this way.
The other flavor, full attention, fills in every cell, so later words can inform earlier ones. BERT and its relatives (such as RoBERTa) use it. It costs more but captures more meaning, which suits tasks like sentiment analysis, where you read a whole passage once to classify it rather than generating word by word. For left-to-right generation you can't use it anyway: it would let the model see the very future words it is supposed to predict.
Influence is not symmetric. In "Freddie Mercury's band Queen rocks", the words "band" and "rocks" pin down that "Queen" means the group, but "Queen" tells you little about "band". So the pull from "band" to "Queen" is strong, the reverse weak. Different numbers each direction, which is why one shared score per pair won't do.
Query, key, value: filling the grid
Where do the numbers in the grid actually come from? The same move ML uses everywhere. When you do not know how to compute something, you let a weight matrix learn it from data.
Each cell blends two things: how ambiguous a word is, and how much another word can resolve that ambiguity. Google's names for these are query and key.
- Query measures how much ambiguity a word carries (how much help it needs).
- Key measures how much a word can resolve other words (how much help it offers).
- Value is a third projection, extra room to encode meaning.
Treat that ambiguity-and-resolution framing as intuition, not a literal definition. What the model actually learns are two projections whose dot product happens to score how strongly one word should attend to another. The "how ambiguous, how disambiguating" story is just a handle to grab it by.
The model projects each word into a query vector and a key vector. To score a pair, it takes the dot product of one word's query and another's key. The dot product is the engine of attention, so it is worth seeing once, with small numbers:
query for "queen" = [0.2, 0.9] (very ambiguous, needs help)
key for "band" = [0.8, 0.7] (strong at disambiguating)
score = 0.2×0.8 + 0.9×0.7
= 0.16 + 0.63
= 0.79 → high pull: "band" strongly shapes "queen"
Multiply matching positions, add the results, and a single number falls out. That number is the strength of one connection in the grid. Real models use hundreds of numbers per word instead of two, but the operation is identical: pair them up, multiply, add.
Those exact values are illustrative. In a real model the query vector comes from the word's embedding multiplied by the learned query weight matrix, and the key vector from the same embedding times the learned key weight matrix. The numbers settle during training and are not human-readable; the "ambiguous" and "disambiguating" labels above are just my gloss to make the example land.
Value is the third projection. It is technically optional, and researchers have built working transformers without it (see Simplifying Transformer Blocks), but it gives the model extra room to encode meaning. The broad pattern in deep learning holds here: more well-trained parameters usually means a more expressive model, so value almost always stays.
Putting it together: one transformer layer
Now the pieces connect. The attention step produces context-weighted embeddings, the gravity-adjusted version of the input. Those feed into the MLP, which reshapes them further. Attention head plus MLP is one transformer layer. That is the repeating unit the whole architecture is built from. (Real layers also wrap each part with a residual connection, which adds the input back to the output, and a normalization step. I'm leaving those off the diagram to keep the shape clear, but they matter a lot for training stability.)
And the unit scales two ways:
You can run several attention heads in parallel, each free to discover a different kind of relationship, then merge their results into one embedding. You can also stack transformer layers in sequence, each one sharpening the meaning a little more. More heads, more layers. Both are levers for a bigger, more capable model.
This explains a question engineers often ask. If every model trains toward the right answer on similar data, why are Claude, Gemini, and GPT different? Two reasons. Architecture varies, with different numbers of heads and layers, and sometimes different building blocks entirely (such as Mamba layers). And training starts from random initial weights, so even identical data produces a different model each run. Comparable ability, completely different numbers under the hood.
It all trains as one piece. The loss compares the model's predictions against the expected next word at every position at once, then flows backward through everything: the output head, the MLP, and the query, key, and value matrices. There is no separate stage for "training attention". The whole thing is one differentiable chain.
The idea worth keeping
If you remember one thing, make it this: any set of numbers can represent an idea. A row of numbers can stand for a word. Another row can stand for a probability across every possible next word. The model never "knows" words in any human sense. It shuffles numbers, and when the loss is set up well, those numbers come to mean something useful. The specific values do not matter. What matters is that they represent something, consistently. That reframe is what turns the architecture from magic into machinery.
Why this matters in practice
This is not only theory. The architecture's shape leaks into daily work with these models.
Because attention has structure, it has biases. A well-documented one is u-shaped attention: models attend most to the beginning and end of the context, and least to the middle (the Lost in the Middle study documented this on long-context tasks). That is not folklore, it falls out of how attention and training interact. It is why the placement of an instruction in a long prompt matters, why a key constraint buried mid-context can get quietly ignored, and why system prompts sit where they do.
Two more facts worth carrying. The attention grid is computed from the input every time, not looked up from a fixed table of word relationships. (Models do reuse keys and values step to step while generating one response, the so-called KV cache, but that is reuse within a single sequence, not a stored dictionary.) And the model is fed each word's position, not just its identity, through positional encodings, which is how it knows order and not only content.
Closing
Transformers look intimidating from the outside and turn out to be a small set of ideas stacked with discipline. Move the numbers. Bend them. Let words shape each other. Predict the next word. Repeat. If you read this far, you can now picture what is happening inside the tools you reach for every day.
If you think I have framed something loosely, I would genuinely like to hear it.
Quick glossary
A few terms in plain English, for reference:
- Vector / embedding: a row of numbers that stands in for a word (or, later, a word in context).
- Tensor: a grid of those numbers; here, words stacked into rows.
- Weights: the model's learned numbers. Anything that is not data flowing through is a weight.
- Linear projection: multiplying by a weight matrix to move points, using straight-line moves only.
- Nonlinearity (ReLU): the bend between linear steps. ReLU turns negatives into zero.
- MLP: multi-layer perceptron, the stack of linear-and-bend layers.
- Softmax: turns raw scores into percentages that sum to 100%, one per possible word.
- Attention: the mechanism that lets each word adjust based on the others.
- Query / key / value: learned projections that decide how strongly words influence each other.
- Causal vs full attention: past-only (for generation) versus all-directions (for classification).
- Transformer layer: one attention head plus one MLP. Stack many to build a model.
References and further reading
The concepts above trace back to a handful of papers. If you want to go deeper than this guide:
- Vaswani et al., Attention Is All You Need (2017). Introduced the transformer and the attention mechanism, and gave attention its name.
- Devlin et al., BERT (2018) and Liu et al., RoBERTa (2019). Full-attention models built for understanding rather than generation.
- He and Hofmann, Simplifying Transformer Blocks (2023). Shows you can strip parts of the standard block, including value and projection parameters, and still train well.
- Gu and Dao, Mamba (2023). An alternative to attention that some models mix in as a building block.
- Liu et al., Lost in the Middle (2023). Measures the u-shaped attention bias and how it affects long contexts.
- Hendrycks and Gimpel, GELU (2016) and Shazeer, GLU Variants Improve Transformer (2020). Nonlinearities used in place of ReLU.