The Neuron
A neuron is a tiny calculator. It takes in numbers, multiplies them by weights, adds them up, and passes the result through one simple function. Billions of these, wired together, produce intelligence.
Analogy first
The Judge
Weighs multiple inputs, assigns each an importance, delivers one verdict.
Volume Dial
Each input connection has a dial — the weight. Turn it up, that input matters more.
Traffic Light
After summing, the neuron decides: pass this signal on, or stop it here?
Three parts — click each one
Step by step
Click each step to walk through the calculation.
Weights
A weight is a single number on a connection between two neurons. It controls how much that input matters. Everything a model has learned — every fact, pattern, preference — is encoded in billions of these numbers.
If the network is a city, weights are road widths. A wide road lets lots of signal through. A zero-width road blocks it. A road running backwards (negative weight) reverses the signal.
Three kinds of weight
Positive (+0.9)
Amplifies. If this input grows, the neuron fires more strongly. "love" → boosts positive sentiment.
Negative (−0.7)
Suppresses. If this input grows, the output shrinks. "not" → dampens positivity.
Zero (0.0)
This connection is severed. Training disconnected it — irrelevant for this task.
Live weight explorer
Drag the weights. The sum and ReLU output update instantly.
Weights form a matrix
All weights between two layers live in a grid — one cell per connection. Click any cell to see what it connects.
The scale problem
Activation Functions
One of the least-understood steps in a neural network — and the most important. Without activation functions, a 100-layer network is mathematically identical to a single layer.
The collapse problem
Layer 2 output = W₂ × (W₁ × input)
= (W₂ × W₁) × input
= W_combined × input
Without activation: no matter how many layers you add, the whole network is just one big multiplication. It can only draw straight lines through data — it cannot learn curves, patterns, or any real-world relationship. Useless for anything meaningful.
Without Activation
Can only separate data with a straight line — like trying to separate cats from dogs using only a ruler.
With Activation
Can learn any curve, spiral, or shape — discovering patterns no straight line could capture.
Visualise different functions
Same input, five functions
Drag the slider. Watch how each function responds differently to the same number.
ReLU
Rectified Linear Unit. Three characters. The most influential activation function in the history of deep learning — and genuinely the simplest idea.
The complete formula
Positive → keep it. Negative → zero.
Positive input
ReLU(0.81) = 0.81
ReLU(3.2) = 3.2
ReLU(0.001)= 0.001
Passes through unchanged.
Negative input
ReLU(−0.5) = 0
ReLU(−9.3) = 0
ReLU(−0.001)=0
Killed. The neuron goes silent.
Try it live
Why ReLU won
⚡ Speed
max(0, x) is a single CPU instruction. Sigmoid requires computing eˣ. With billions of neurons and millions of training steps, that difference is enormous.
🔀 Sparsity
About half the neurons output zero at any moment. This sparse activation is efficient — neurons specialise rather than all doing the same thing.
📈 Gradient flow
For positive values, the gradient is exactly 1 — it doesn't shrink. Sigmoid's gradient collapsed to near-zero deep in networks, making training fail. ReLU solved this and made deep learning possible.
The Dying ReLU problem: A neuron always receiving negative input outputs 0 forever — its weights never get a gradient. It "dies." Leaky ReLU fixes this: negatives become 0.01x instead of 0, keeping a tiny gradient trickle alive.
The Full Network
A neural network is layers of neurons. Each layer's outputs feed into the next layer's inputs. Information flows forward. Error signals flow backward. Learning happens through repetition.
Credit risk — a real example
Let's trace "loan overdue — 90 days" through a 3-layer network. Click any neuron.
1.0
0.9
0.6
0.87
0.74
0.0
0.91
0.0
RISK
91%
What training does
# Before training: all weights random weights = [0.1, -0.3, 0.7, ...] # 70 billion of these # Model predicts LOW RISK for a 90-day overdue loan prediction = 0.12 # wrong. should be ~0.9 error = 0.9 - 0.12 # = 0.78 too low # Backpropagation: push error signal backwards for w in weights: w += learning_rate * gradient # tiny nudge # Repeat billions of times → weights encode "risky"
Inference
Training is finished. Weights are frozen. You type a prompt — the model runs one forward pass per token, using those frozen weights to calculate what comes next.
A prompt, token by token
Weights shape every output
Change these word-weights. The sentiment score updates instantly.
Set w_love to −1. The model "forgets" that love is positive.
Next-token probabilities
After "The BNPL repayment is", the model scores every word in its 50,000-word vocabulary. Top 5:
The Fine-Tuning Problem
A model trained on the internet knows everything and specialises in nothing. To make it great at BNPL compliance, medical notes, or legal drafting, you need to fine-tune it. The problem is the cost.
Full fine-tuning — brute force
Llama 70B at 2 bytes/weight = 140GB just to store weights. Add gradient memory (another 140GB) and optimiser states (×2) = ~600GB GPU RAM. An A100 80GB GPU costs ~$15,000. You'd need 8 of them. For one fine-tuning run.
The insight that changed everything
When you fine-tune a model, most weights barely move. The useful update ΔW lives in a tiny mathematical subspace — it is low rank. You don't need to update all 70 billion weights. You just need to approximate the update.
What "low rank" means
A 4×4 matrix update needs 16 numbers. But if the update has a pattern — a structure — you can describe it with far fewer numbers using two small matrices multiplied together.
16 numbers → approximated by 4+4 = 8 numbers. That's rank 1. In real LoRA with d=4096, r=8: 16 million → 65,000. A 250× reduction in parameters to train.
LoRA
Low-Rank Adaptation. Instead of updating the giant weight matrix W, you freeze it completely and learn two tiny matrices A and B. Their product approximates the update. W never changes.
The architecture — click any box
input
Pretrained — FROZEN
squeezes d→r
expands r→d
both paths added
output
Rank — the key trade-off
Why B starts at zero
A → random Gaussian B → all zeros
∴ B×A = 0 at step 0
LoRA starts behaving exactly like the base model.
No personality shift — it diverges gradually as B learns.
LoRA Variants
Researchers kept asking the same question: can we push this further? Each variant squeezes more efficiency out of the low-rank idea, trading something different each time.
Compare All Techniques
Parameter efficiency
Full comparison
| Technique | Trained | W updated? | Key idea | Best for |
|---|---|---|---|---|
| LoRA | A+B per layer | ✗ | Low-rank side path | General purpose |
| LoRA-FA | B only per layer | ✗ | Freeze A, save memory | Memory constrained |
| VeRA | 2 vectors per layer | ✗ | Share A,B globally | Extreme efficiency |
| Delta-LoRA | A+B+W | ✓ | Feed delta into W | Complex tasks |
| LoRA+ | A+B per layer | ✗ | B learns faster | Free upgrade |
One-line mental models
For BNPL/compliance domain adaptation: LoRA+ on a 7–8B model is the practical starting point — cheap, reversible, swappable per use case.