← zero2ai / Neurons, Weights & Fine-Tuning — Deep Dive
Deep Dive · 01

The Neuron

A neuron is a tiny calculator. It takes in numbers, multiplies them by weights, adds them up, and passes the result through one simple function. Billions of these, wired together, produce intelligence.

Analogy first

🧑‍⚖️

The Judge

Weighs multiple inputs, assigns each an importance, delivers one verdict.

🎚️

Volume Dial

Each input connection has a dial — the weight. Turn it up, that input matters more.

🚦

Traffic Light

After summing, the neuron decides: pass this signal on, or stop it here?

Three parts — click each one

Interactive anatomy — click any part
INPUTS × WEIGHTS
x₁0.8
w=0.6
x₂0.3
w=−0.4
x₃0.5
w=0.9
WEIGHTED SUM
Σ
ACTIVATION
f( )
OUTPUT
h0.81
→ next layer
Click any part above to see what it does.

Step by step

Click each step to walk through the calculation.

Receive inputs
x₁=0.8, x₂=0.3, x₃=0.5. Could be pixel values, word vectors, or outputs from the previous layer. Always just numbers.
x₁ x₂ x₃
Multiply by weights
0.8×0.6 = 0.48  ·  0.3×(−0.4) = −0.12  ·  0.5×0.9 = 0.45
× w
Sum everything (Σ)
0.48 + (−0.12) + 0.45 = 0.81. This is the dot product.
= 0.81
Apply activation
ReLU(0.81) = max(0, 0.81) = 0.81. If it were −0.3, ReLU would output 0 — neuron goes silent.
→ 0.81
Fire to the next layer
0.81 becomes one input to every neuron in the next layer. The chain continues until the final answer emerges.
→ layer 2
← Back to zero2ai
Deep Dive · 02

Weights

A weight is a single number on a connection between two neurons. It controls how much that input matters. Everything a model has learned — every fact, pattern, preference — is encoded in billions of these numbers.

🏙️

If the network is a city, weights are road widths. A wide road lets lots of signal through. A zero-width road blocks it. A road running backwards (negative weight) reverses the signal.

Three kinds of weight

Positive (+0.9)

Amplifies. If this input grows, the neuron fires more strongly. "love" → boosts positive sentiment.

Negative (−0.7)

Suppresses. If this input grows, the output shrinks. "not" → dampens positivity.

Zero (0.0)

This connection is severed. Training disconnected it — irrelevant for this task.

Live weight explorer

Drag the weights. The sum and ReLU output update instantly.

Weight → Output — drag any slider
Sum = —
After ReLU → —
Try: set w_love to −1  ·  set all to 0  ·  set all to +1

Weights form a matrix

All weights between two layers live in a grid — one cell per connection. Click any cell to see what it connects.

The scale problem

3×4 toy layer
12
1024×1024
~1M
4096×4096 (LLM)
~16M
GPT-4 total
~1.8T
Deep Dive · 03

Activation Functions

One of the least-understood steps in a neural network — and the most important. Without activation functions, a 100-layer network is mathematically identical to a single layer.

The collapse problem

Layer 2 output = W₂ × (W₁ × input)
             = (W₂ × W₁) × input
             = W_combined × input

🚨

Without activation: no matter how many layers you add, the whole network is just one big multiplication. It can only draw straight lines through data — it cannot learn curves, patterns, or any real-world relationship. Useless for anything meaningful.

📏

Without Activation

Can only separate data with a straight line — like trying to separate cats from dogs using only a ruler.

🌊

With Activation

Can learn any curve, spiral, or shape — discovering patterns no straight line could capture.

Visualise different functions

ReLU
Sigmoid
Tanh
GELU
Leaky ReLU
ReLU

Same input, five functions

Drag the slider. Watch how each function responds differently to the same number.

z = 1.5
Deep Dive · 04

ReLU

Rectified Linear Unit. Three characters. The most influential activation function in the history of deep learning — and genuinely the simplest idea.

The complete formula

ReLU(x) = max(0, x)

Positive → keep it.   Negative → zero.

Positive input

ReLU(0.81) = 0.81
ReLU(3.2) = 3.2
ReLU(0.001)= 0.001

Passes through unchanged.

Negative input

ReLU(−0.5) = 0
ReLU(−9.3) = 0
ReLU(−0.001)=0

Killed. The neuron goes silent.

Try it live

x = 1.0
INPUT x
max(0, x)
OUTPUT
ReLU(x) = max(0, x)

Why ReLU won

⚡ Speed

max(0, x) is a single CPU instruction. Sigmoid requires computing eˣ. With billions of neurons and millions of training steps, that difference is enormous.

🔀 Sparsity

About half the neurons output zero at any moment. This sparse activation is efficient — neurons specialise rather than all doing the same thing.

📈 Gradient flow

For positive values, the gradient is exactly 1 — it doesn't shrink. Sigmoid's gradient collapsed to near-zero deep in networks, making training fail. ReLU solved this and made deep learning possible.

💀

The Dying ReLU problem: A neuron always receiving negative input outputs 0 forever — its weights never get a gradient. It "dies." Leaky ReLU fixes this: negatives become 0.01x instead of 0, keeping a tiny gradient trickle alive.

Deep Dive · 05

The Full Network

A neural network is layers of neurons. Each layer's outputs feed into the next layer's inputs. Information flows forward. Error signals flow backward. Learning happens through repetition.

Credit risk — a real example

Let's trace "loan overdue — 90 days" through a 3-layer network. Click any neuron.

3-layer network — click any neuron
INPUT
loan
1.0
overdue
0.9
amount
0.6
HIDDEN 1 + ReLU
risk
0.87
sev.
0.74
hist
0.0
3rd neuron dead
HIDDEN 2 + ReLU
high
0.91
low
0.0
OUTPUT
HIGH
RISK
91%
Click any neuron to see what it detects.

What training does

# Before training: all weights random
weights = [0.1, -0.3, 0.7, ...]  # 70 billion of these

# Model predicts LOW RISK for a 90-day overdue loan
prediction = 0.12  # wrong. should be ~0.9
error = 0.9 - 0.12  # = 0.78 too low

# Backpropagation: push error signal backwards
for w in weights:
    w += learning_rate * gradient  # tiny nudge

# Repeat billions of times → weights encode "risky"
Deep Dive · 06

Inference

Training is finished. Weights are frozen. You type a prompt — the model runs one forward pass per token, using those frozen weights to calculate what comes next.

A prompt, token by token

Click a token to trace it through the model
Prompt: "The BNPL repayment is overdue"

Weights shape every output

Change these word-weights. The sentiment score updates instantly.

w_love 0.80 +0.64
w_hate −0.70 −0.28
w_good 0.60 +0.36
Sentiment: +0.72😊 Positive

Set w_love to −1. The model "forgets" that love is positive.

Next-token probabilities

After "The BNPL repayment is", the model scores every word in its 50,000-word vocabulary. Top 5:

Deep Dive · 07

The Fine-Tuning Problem

A model trained on the internet knows everything and specialises in nothing. To make it great at BNPL compliance, medical notes, or legal drafting, you need to fine-tune it. The problem is the cost.

Full fine-tuning — brute force

Llama 3 8B
8B
Llama 3 70B
70B
GPT-4 (~1.8T)
~1.8T
💸

Llama 70B at 2 bytes/weight = 140GB just to store weights. Add gradient memory (another 140GB) and optimiser states (×2) = ~600GB GPU RAM. An A100 80GB GPU costs ~$15,000. You'd need 8 of them. For one fine-tuning run.

The insight that changed everything

🔬

When you fine-tune a model, most weights barely move. The useful update ΔW lives in a tiny mathematical subspace — it is low rank. You don't need to update all 70 billion weights. You just need to approximate the update.

What "low rank" means

A 4×4 matrix update needs 16 numbers. But if the update has a pattern — a structure — you can describe it with far fewer numbers using two small matrices multiplied together.

💡

16 numbers → approximated by 4+4 = 8 numbers. That's rank 1. In real LoRA with d=4096, r=8: 16 million → 65,000. A 250× reduction in parameters to train.

Deep Dive · 08

LoRA

Low-Rank Adaptation. Instead of updating the giant weight matrix W, you freeze it completely and learn two tiny matrices A and B. Their product approximates the update. W never changes.

The architecture — click any box

LoRA forward pass
x ∈ ℝᵈ
input
W ∈ ℝᵈˣᵈ
Pretrained — FROZEN
Wx
A ∈ ℝᵈˣʳ
squeezes d→r
B ∈ ℝʳˣᵈ
expands r→d
BAx
Wx + BAx
both paths added
h ∈ ℝᵈ
output
Click W, A, B, or the sum box to understand each part.

Rank — the key trade-off

r = 8
LoRA params as % of full W (d=4096)

Why B starts at zero

A → random Gaussian   B → all zeros
∴ B×A = 0 at step 0

LoRA starts behaving exactly like the base model.
No personality shift — it diverges gradually as B learns.

Deep Dive · 09

LoRA Variants

Researchers kept asking the same question: can we push this further? Each variant squeezes more efficiency out of the low-rank idea, trading something different each time.

Deep Dive · 10

Compare All Techniques

Parameter efficiency

Full Fine-Tune
100%
Delta-LoRA
1.4%
LoRA (r=8)
0.8%
LoRA+ (r=8)
0.8%
LoRA-FA
0.4%
VeRA
0.05%

Full comparison

TechniqueTrainedW updated?Key ideaBest for
LoRAA+B per layerLow-rank side pathGeneral purpose
LoRA-FAB only per layerFreeze A, save memoryMemory constrained
VeRA2 vectors per layerShare A,B globallyExtreme efficiency
Delta-LoRAA+B+WFeed delta into WComplex tasks
LoRA+A+B per layerB learns fasterFree upgrade

One-line mental models

LoRA — Small shortcut alongside the frozen highway
LoRA-FA — Same shortcut, but half is nailed down from day one
VeRA — One shortcut shared by all layers; each just gets a volume knob
Delta-LoRA — Shortcut gradually rewrites the highway itself
LoRA+ — Same shortcut, but the second matrix gets a turbo boost
🏦

For BNPL/compliance domain adaptation: LoRA+ on a 7–8B model is the practical starting point — cheap, reversible, swappable per use case.

← Back to zero2ai