Deep Dive · 01

The Neuron

A neuron is a tiny calculator. It takes in numbers, multiplies them by weights, adds them up, and passes the result through one simple function. Billions of these, wired together, produce intelligence.

Analogy first

🧑‍⚖️

The Judge

Weighs multiple inputs, assigns each an importance, delivers one verdict.

🎚️

Volume Dial

Each input connection has a dial — the weight. Turn it up, that input matters more.

🚦

Traffic Light

After summing, the neuron decides: pass this signal on, or stop it here?

Three parts — click each one

Interactive anatomy — click any part

INPUTS × WEIGHTS

x₁0.8

w=0.6

x₂0.3

w=−0.4

x₃0.5

w=0.9

→

WEIGHTED SUM

Σ

→

ACTIVATION

f( )

→

OUTPUT

h0.81

→ next layer

Click any part above to see what it does.

Step by step

Click each step to walk through the calculation.

Receive inputs

x₁=0.8, x₂=0.3, x₃=0.5. Could be pixel values, word vectors, or outputs from the previous layer. Always just numbers.

x₁ x₂ x₃

Multiply by weights

0.8×0.6 = 0.48 · 0.3×(−0.4) = −0.12 · 0.5×0.9 = 0.45

× w

Sum everything (Σ)

0.48 + (−0.12) + 0.45 = 0.81. This is the dot product.

= 0.81

Apply activation

ReLU(0.81) = max(0, 0.81) = 0.81. If it were −0.3, ReLU would output 0 — neuron goes silent.

→ 0.81

Fire to the next layer

0.81 becomes one input to every neuron in the next layer. The chain continues until the final answer emerges.

→ layer 2

← Back to zero2ai

Deep Dive · 02

Weights

A weight is a single number on a connection between two neurons. It controls how much that input matters. Everything a model has learned — every fact, pattern, preference — is encoded in billions of these numbers.

🏙️

If the network is a city, weights are road widths. A wide road lets lots of signal through. A zero-width road blocks it. A road running backwards (negative weight) reverses the signal.

Three kinds of weight

Positive (+0.9)

Amplifies. If this input grows, the neuron fires more strongly. "love" → boosts positive sentiment.

Negative (−0.7)

Suppresses. If this input grows, the output shrinks. "not" → dampens positivity.

Zero (0.0)

This connection is severed. Training disconnected it — irrelevant for this task.

Live weight explorer

Drag the weights. The sum and ReLU output update instantly.

Weight → Output — drag any slider

Sum = —

After ReLU → —

      Try: set w_love to −1  ·  set all to 0  ·  set all to +1
    

Weights form a matrix

All weights between two layers live in a grid — one cell per connection. Click any cell to see what it connects.

The scale problem

3×4 toy layer

12

1024×1024

~1M

4096×4096 (LLM)

~16M

GPT-4 total

~1.8T

Deep Dive · 03

Activation Functions

One of the least-understood steps in a neural network — and the most important. Without activation functions, a 100-layer network is mathematically identical to a single layer.

The collapse problem

Layer 2 output = W₂ × (W₁ × input)
= (W₂ × W₁) × input
= W_combined × input

🚨

Without activation: no matter how many layers you add, the whole network is just one big multiplication. It can only draw straight lines through data — it cannot learn curves, patterns, or any real-world relationship. Useless for anything meaningful.

📏

Without Activation

Can only separate data with a straight line — like trying to separate cats from dogs using only a ruler.

🌊

With Activation

Can learn any curve, spiral, or shape — discovering patterns no straight line could capture.

Visualise different functions

ReLU

Sigmoid

Tanh

GELU

Leaky ReLU

ReLU

Same input, five functions

Drag the slider. Watch how each function responds differently to the same number.

z = 1.5

Deep Dive · 04

ReLU

Rectified Linear Unit. Three characters. The most influential activation function in the history of deep learning — and genuinely the simplest idea.

The complete formula

ReLU(x) = max(0, x)

Positive → keep it. Negative → zero.

Positive input

ReLU(0.81) = 0.81
ReLU(3.2) = 3.2
ReLU(0.001)= 0.001

Passes through unchanged.

Negative input

ReLU(−0.5) = 0
ReLU(−9.3) = 0
ReLU(−0.001)=0

Killed. The neuron goes silent.

Try it live

x = 1.0

INPUT x

max(0, x)

OUTPUT

ReLU(x) = max(0, x)

Why ReLU won

⚡ Speed

max(0, x) is a single CPU instruction. Sigmoid requires computing eˣ. With billions of neurons and millions of training steps, that difference is enormous.

🔀 Sparsity

About half the neurons output zero at any moment. This sparse activation is efficient — neurons specialise rather than all doing the same thing.

📈 Gradient flow

For positive values, the gradient is exactly 1 — it doesn't shrink. Sigmoid's gradient collapsed to near-zero deep in networks, making training fail. ReLU solved this and made deep learning possible.

💀

The Dying ReLU problem: A neuron always receiving negative input outputs 0 forever — its weights never get a gradient. It "dies." Leaky ReLU fixes this: negatives become 0.01x instead of 0, keeping a tiny gradient trickle alive.

Deep Dive · 05

The Full Network

A neural network is layers of neurons. Each layer's outputs feed into the next layer's inputs. Information flows forward. Error signals flow backward. Learning happens through repetition.

Credit risk — a real example

Let's trace "loan overdue — 90 days" through a 3-layer network. Click any neuron.

3-layer network — click any neuron

INPUT

loan
1.0

overdue
0.9

amount
0.6

→

HIDDEN 1 + ReLU

risk
0.87

sev.
0.74

hist
0.0

3rd neuron dead

→

HIDDEN 2 + ReLU

high
0.91

low
0.0

→

OUTPUT

HIGH
RISK
91%

Click any neuron to see what it detects.

What training does

# Before training: all weights random
weights = [0.1, -0.3, 0.7, ...]  # 70 billion of these

# Model predicts LOW RISK for a 90-day overdue loan
prediction = 0.12  # wrong. should be ~0.9
error = 0.9 - 0.12  # = 0.78 too low

# Backpropagation: push error signal backwards
for w in weights:
    w += learning_rate * gradient  # tiny nudge

# Repeat billions of times → weights encode "risky"

Deep Dive · 06

Inference

Training is finished. Weights are frozen. You type a prompt — the model runs one forward pass per token, using those frozen weights to calculate what comes next.

A prompt, token by token

Click a token to trace it through the model

Prompt: "The BNPL repayment is overdue"

Weights shape every output

Change these word-weights. The sentiment score updates instantly.

w_love 0.80 +0.64

w_hate −0.70 −0.28

w_good 0.60 +0.36

      Sentiment: +0.72 → 😊 Positive
    

Set w_love to −1. The model "forgets" that love is positive.

Next-token probabilities

After "The BNPL repayment is", the model scores every word in its 50,000-word vocabulary. Top 5:

Deep Dive · 07

The Fine-Tuning Problem

A model trained on the internet knows everything and specialises in nothing. To make it great at BNPL compliance, medical notes, or legal drafting, you need to fine-tune it. The problem is the cost.

Full fine-tuning — brute force

Llama 3 8B

8B

Llama 3 70B

70B

GPT-4 (~1.8T)

~1.8T

💸

Llama 70B at 2 bytes/weight = 140GB just to store weights. Add gradient memory (another 140GB) and optimiser states (×2) = ~600GB GPU RAM. An A100 80GB GPU costs ~$15,000. You'd need 8 of them. For one fine-tuning run.

The insight that changed everything

🔬

When you fine-tune a model, most weights barely move. The useful update ΔW lives in a tiny mathematical subspace — it is low rank. You don't need to update all 70 billion weights. You just need to approximate the update.

What "low rank" means

A 4×4 matrix update needs 16 numbers. But if the update has a pattern — a structure — you can describe it with far fewer numbers using two small matrices multiplied together.

💡

16 numbers → approximated by 4+4 = 8 numbers. That's rank 1. In real LoRA with d=4096, r=8: 16 million → 65,000. A 250× reduction in parameters to train.

Deep Dive · 08

LoRA

Low-Rank Adaptation. Instead of updating the giant weight matrix W, you freeze it completely and learn two tiny matrices A and B. Their product approximates the update. W never changes.

The architecture — click any box

LoRA forward pass

x ∈ ℝᵈ
input

↓

W ∈ ℝᵈˣᵈ
Pretrained — FROZEN

Wx

↓

A ∈ ℝᵈˣʳ
squeezes d→r

↓

B ∈ ℝʳˣᵈ
expands r→d

BAx

Wx + BAx
both paths added

h ∈ ℝᵈ
output

Click W, A, B, or the sum box to understand each part.

Rank — the key trade-off

r = 8

LoRA params as % of full W (d=4096)

Why B starts at zero

A → random Gaussian B → all zeros
∴ B×A = 0 at step 0

LoRA starts behaving exactly like the base model.
No personality shift — it diverges gradually as B learns.

Deep Dive · 09

LoRA Variants

Researchers kept asking the same question: can we push this further? Each variant squeezes more efficiency out of the low-rank idea, trading something different each time.

Deep Dive · 10

Compare All Techniques

Parameter efficiency

Full Fine-Tune

100%

Delta-LoRA

1.4%

LoRA (r=8)

0.8%

LoRA+ (r=8)

0.8%

LoRA-FA

0.4%

VeRA

0.05%

Full comparison

Technique	Trained	W updated?	Key idea	Best for
LoRA	A+B per layer	✗	Low-rank side path	General purpose
LoRA-FA	B only per layer	✗	Freeze A, save memory	Memory constrained
VeRA	2 vectors per layer	✗	Share A,B globally	Extreme efficiency
Delta-LoRA	A+B+W	✓	Feed delta into W	Complex tasks
LoRA+	A+B per layer	✗	B learns faster	Free upgrade

One-line mental models

LoRA — Small shortcut alongside the frozen highway

LoRA-FA — Same shortcut, but half is nailed down from day one

VeRA — One shortcut shared by all layers; each just gets a volume knob

Delta-LoRA — Shortcut gradually rewrites the highway itself

LoRA+ — Same shortcut, but the second matrix gets a turbo boost

🏦

For BNPL/compliance domain adaptation: LoRA+ on a 7–8B model is the practical starting point — cheap, reversible, swappable per use case.

← Back to zero2ai