Math graphic
📐 Concept diagram

16-01 — The Perceptron Model

Phase: 16 — Neural Network Mathematics Subject: 16-01 Prerequisites: Phase 1 (linear equations), Phase 2 (vectors), Phase 3 (functions), Phase 4 (derivatives), Phase 10 (probability basics) Next subject: 16-02 — Activation Functions


Learning Objectives

By the end of this subject, you will be able to:

  1. Explain the perceptron as a linear binary classifier and write its decision rule mathematically
  2. Derive the geometric interpretation of a perceptron as a separating hyperplane
  3. Implement the perceptron learning algorithm and explain why it converges for linearly separable data
  4. Prove that a single perceptron cannot solve the XOR problem, motivating multi-layer networks
  5. Relate the perceptron update rule to gradient descent on a surrogate loss

Core Content

1. What Is a Perceptron?

The perceptron, introduced by Frank Rosenblatt in 1958, is the simplest artificial neural network: a single neuron that makes binary decisions. It takes a vector of inputs, computes a weighted sum, and outputs +1 or −1 (or 1 or 0) based on whether that sum exceeds a threshold.

Formally, let x = [x₁, x₂, ..., xₙ]ᵀ be the input vector (with n features). The perceptron has:

The perceptron computes:

z = wx + b = Σᵢ wᵢxᵢ + b

And then applies a step function (sign function) to produce the output:

ŷ = sign(z) = { +1 if z ≥ 0 { −1 if z < 0

(Some formulations use 1/0 instead of +1/−1. Both are equivalent up to a shift in the bias; we use +1/−1 because it makes the algebra cleaner.)

Why "perceptron"? Rosenblatt envisioned it as a simplified model of a biological neuron: inputs (dendrites) are weighted (synaptic strengths), summed in the cell body (soma), and if the sum exceeds a threshold, the neuron "fires" (outputs a signal). This is a drastic simplification — real neurons are vastly more complex — but it captures the core idea of thresholded computation.

2. Geometric Interpretation: The Decision Boundary

A perceptron draws a straight line (in 2D), a plane (in 3D), or a hyperplane (in n dimensions) that separates the input space into two halves. The decision boundary is all points where z = 0:

wx + b = 0

This is the equation of a hyperplane. The weight vector w is normal (perpendicular) to this hyperplane. The bias b determines how far the hyperplane is from the origin.

Why is w normal to the hyperplane? Take any two points x₁ and x₂ on the hyperplane. Then:

wx₁ + b = 0 and wx₂ + b = 0 → wᵀ(x₁x₂) = 0

So w is orthogonal to every vector lying in the hyperplane — hence w is perpendicular to the hyperplane.

Intuition: The dot product wx measures the projection of x onto w. Points that project further in the direction of w get positive scores; points projecting in the opposite direction get negative scores. The bias shifts the cutoff point.

Example in 2D: Let w = [2, 1]ᵀ, b = −3. The decision boundary is:

2x₁ + x₂ − 3 = 0 → x₂ = 3 − 2x₁

This is a line with slope −2, y-intercept 3. Points above the line get positive outputs; points below get negative.

3. The Perceptron Learning Algorithm

How do we find the right w and b? The perceptron learning algorithm is an online (one example at a time) algorithm:

Algorithm: 1. Initialize w0 (or small random values), b ← 0 2. For each training example (x⁽ⁱ⁾, y⁽ⁱ⁾) where y⁽ⁱ⁾ ∈ {+1, −1}: - Compute prediction: ŷ = sign(wx⁽ⁱ⁾ + b) - If ŷ ≠ y⁽ⁱ⁾ (misclassification): - ww + y⁽ⁱ⁾·x⁽ⁱ⁾ - b ← b + y⁽ⁱ⁾ 3. Repeat through the dataset until no misclassifications (or until a maximum number of epochs)

Why does this update make sense? Suppose y⁽ⁱ⁾ = +1 but we predicted −1. Then wx⁽ⁱ⁾ + b < 0 (too negative). The update:

ww + x⁽ⁱ⁾, b ← b + 1

This increases wx⁽ⁱ⁾ + b, making it more positive — pushing the prediction toward +1. Conversely, if y⁽ⁱ⁾ = −1 and we predicted +1, the update subtracts x⁽ⁱ⁾, pulling the score negative.

⚠️ THIS IS CRITICAL — The perceptron update is closely related to gradient descent on a specific loss function. Understanding this connection builds intuition for how all neural networks learn through gradient-based optimization.

Connection to gradient descent: Consider the perceptron loss (also called 0-1 loss proxy):

L(w, b) = −y⁽ⁱ⁾ · (wx⁽ⁱ⁾ + b) · 1_{y⁽ⁱ⁾·(wx⁽ⁱ⁾ + b) ≤ 0}

The gradient w.r.t. w is −y⁽ⁱ⁾x⁽ⁱ⁾ when misclassified. The update ww − η·∇L = w + η·y⁽ⁱ⁾x⁽ⁱ⁾ is exactly the perceptron update with learning rate η = 1.

4. The Perceptron Convergence Theorem

Theorem (Rosenblatt, 1958): If the training data is linearly separable, the perceptron learning algorithm converges to a separating hyperplane in finitely many steps.

Proof sketch: Let w be any separating weight vector with unit norm (||w|| = 1) and margin γ > 0 (meaning y⁽ⁱ⁾·w*x⁽ⁱ⁾ ≥ γ for all i). Let R = maxᵢ ||x⁽ⁱ⁾||.

After k updates (mistakes), we can show: 1. ww₍ₖ₎ ≥ kγ (the weight vector aligns with w) 2. ||w**₍ₖ₎||² ≤ kR² (the norm grows slowly)

By Cauchy-Schwarz: ww₍ₖ₎ ≤ ||w||·||w₍ₖ₎|| = ||w**₍ₖ₎||

So: kγ ≤ √(kR²) → k ≤ (R/γ)²

Therefore, at most (R/γ)² mistakes can be made before convergence.

Key takeaway: The number of mistakes is bounded, and the bound depends on the margin (larger margin → fewer mistakes) and the scale of the data.

Important caveat: If the data is NOT linearly separable, the perceptron algorithm never converges — it will keep making mistakes forever. This is the fundamental limitation of the perceptron.

5. The XOR Problem: Why Single Perceptrons Fail

The XOR (exclusive OR) function is:

x₁ x₂ XOR(x₁, x₂)
0 0 0
0 1 1
1 0 1
1 1 0

Theorem: XOR is not linearly separable. No single perceptron can classify XOR correctly.

Proof: Suppose a perceptron with weights w₁, w₂, bias b could separate the four points. Then:

Adding inequalities (2) and (3): w₁ + w₂ + 2b ≥ 0. From (4): w₁ + w₂ + b < 0.

Subtract (4) from the sum of (2)+(3): (w₁ + w₂ + 2b) − (w₁ + w₂ + b) ≥ 0 − 0 → b ≥ 0. But from (1): b < 0. Contradiction!

Therefore, no such w₁, w₂, b exist. XOR is not linearly separable.

Geometric intuition: The four points form a square. If a line separates them, the line must put (0,0) and (1,1) on one side, and (0,1) and (1,0) on the other. But no line can do this because the positive points are diagonally opposite from the negative points — any line trying to separate them would need to "split" the square in an impossible way. Visually, try drawing a single straight line that separates the red (1) and blue (0) points on the XOR grid — you can't.

Consequence: This limitation motivated the development of multi-layer perceptrons (MLPs) — neural networks with one or more hidden layers. A two-layer network can solve XOR by composing two linear boundaries: one hidden neuron detects "at least one input is 1" and another detects "both inputs are 1", then the output neuron combines these.



Key Terms

Worked Examples

Example 1: Geometric Interpretation

Problem: A perceptron has w = [3, −2]ᵀ and b = 1. Plot the decision boundary in 2D and determine which side corresponds to output +1.

Solution:

Step 1 — Write the decision boundary equation:

3x₁ − 2x₂ + 1 = 0 → 2x₂ = 3x₁ + 1 → x₂ = 1.5x₁ + 0.5

Step 2 — Plot: This is a line with slope 1.5 passing through (0, 0.5).

Step 3 — Determine sides: Test a point, say (0,0):

z = 3·0 − 2·0 + 1 = 1 ≥ 0 → output +1

So the half-plane containing the origin gives +1.

Step 4 — The normal vector w = [3, −2]ᵀ points toward the +1 region.

Answer: The line x₂ = 1.5x₁ + 0.5 separates the plane. Points with 3x₁ − 2x₂ + 1 ≥ 0 (including the origin) get +1; the other side gets −1.

Example 2: Perceptron Learning by Hand

Problem: Train a perceptron on this dataset (using +1/−1 labels):

x₁ x₂ y
0 0 −1
0 1 −1
1 0 +1
1 1 +1

This is an AND-like function (y = +1 only when x₁ = 1). Start with w = [0, 0]ᵀ, b = 0.

Solution:

Epoch 1:

Epoch 2:

Epoch 3: Check all — all correct!

Final weights: w = [1, 0]ᵀ, b = −1

Verify: Decision boundary: x₁ − 1 = 0 → x₁ = 1. This is a vertical line at x₁ = 1. Points with x₁ ≥ 1 get +1, the rest get −1. This perfectly separates the data.

Example 3: Proving Non-Linear Separability

Problem: Prove that the following 2D dataset is not linearly separable:

x₁ x₂ y
−1 −1 +1
−1 +1 −1
+1 −1 −1
+1 +1 +1

(This is the "XNOR" or equality function — y = +1 when x₁ = x₂.)

Solution:

Suppose there exist w₁, w₂, b such that sign(w₁x₁ + w₂x₂ + b) = y for all four points.

From the four points we get: (1) w₁(−1) + w₂(−1) + b ≥ 0 → −w₁ − w₂ + b ≥ 0 (2) w₁(−1) + w₂(+1) + b < 0 → −w₁ + w₂ + b < 0 (3) w₁(+1) + w₂(−1) + b < 0 → w₁ − w₂ + b < 0 (4) w₁(+1) + w₂(+1) + b ≥ 0 → w₁ + w₂ + b ≥ 0

Add (1) and (4): (−w₁ − w₂ + b) + (w₁ + w₂ + b) ≥ 0 → 2b ≥ 0 → b ≥ 0

Add (2) and (3): (−w₁ + w₂ + b) + (w₁ − w₂ + b) < 0 → 2b < 0 → b < 0

Contradiction! b cannot be both ≥ 0 and < 0. Therefore, no such w₁, w₂, b exist.

Geometric intuition: The +1 points are on one diagonal of the square and the −1 points on the other diagonal. No straight line can separate opposite corners from each other.

Practice Problems

(Answers are below. Try each problem before checking.)

Problem 1: A perceptron has weights w = [−1, 2]ᵀ and bias b = −1. Compute the output for input x = [3, 1]ᵀ.

Problem 2: Find the equation of the decision boundary for the perceptron with w = [4, −1, 2]ᵀ and b = 5. What is the geometric shape of this boundary in 3D?

Problem 3: Apply one epoch of the perceptron algorithm to the dataset below with initial w = [0, 0]ᵀ, b = 0. Show all updates.

x₁ x₂ y
2 1 +1
−1 3 −1
0 −2 −1
1 −1 +1

Problem 4: Show that the 3D dataset below is linearly separable by finding weights and bias manually (inspection is fine).

x₁ x₂ x₃ y
0 0 0 −1
1 0 0 −1
0 0 1 +1
1 0 1 +1

Problem 5: Prove that for a perceptron with zero bias (b = 0), if x is classified as +1, then any positive scalar multiple cx (c > 0) is also classified as +1. What is the geometric interpretation?

Problem 6: Consider the OR function (y = +1 if either x₁ = 1 OR x₂ = 1, else −1). Write a perceptron that implements OR and verify it classifies all four input combinations correctly.

Problem 7: The perceptron convergence theorem gives a mistake bound of (R/γ)². For the dataset in Example 2, compute R and γ (for w* found by the algorithm) and verify that fewer mistakes were made than this bound.

Answers (click to expand) **Problem 1:** z = (−1)·3 + 2·1 − 1 = −3 + 2 − 1 = −2. Since z < 0, ŷ = −1. **Problem 2:** Decision boundary: 4x₁ − x₂ + 2x₃ + 5 = 0. This is a plane in 3D, with normal vector [4, −1, 2]ᵀ. **Problem 3:** Epoch 1: - (2,1,+1): z = 0+0+0 = 0 → ŷ = +1. Correct. - (−1,3,−1): z = 0+0+0 = 0 → ŷ = +1. Misclassified! **w** ← [0,0] + (−1)·[−1,3] = [1,−3], b ← 0 − 1 = −1 - (0,−2,−1): z = 1·0 + (−3)·(−2) − 1 = 5 → ŷ = +1. Misclassified! **w** ← [1,−3] + (−1)·[0,−2] = [1,−1], b ← −1 − 1 = −2 - (1,−1,+1): z = 1·1 + (−1)·(−1) − 2 = 0 → ŷ = +1. Correct. Final after one epoch: **w** = [1, −1]ᵀ, b = −2. The algorithm has not yet converged — additional epochs are needed. For example, checking (2,1,+1): z = 2−1−2 = −1 → ŷ = −1, a misclassification. The perceptron algorithm would continue updating in subsequent epochs until convergence. **Problem 4:** Observe that y = +1 exactly when x₃ = 1 (regardless of x₁). So **w** = [0, 0, 1]ᵀ, b = −0.5 works: sign(0·x₁ + 0·x₂ + 1·x₃ − 0.5). For x₃ = 0: z = −0.5 → −1. For x₃ = 1: z = 0.5 → +1. All four points classified correctly. **Problem 5:** If sign(**w**ᵀ**x**) = +1, then **w**ᵀ**x** ≥ 0. For any c > 0: **w**ᵀ(c**x**) = c·**w**ᵀ**x** ≥ 0. So sign remains +1. Geometric interpretation: the positive half-space is a cone from the origin — if a point is inside it, the entire ray from the origin through that point is also inside. **Problem 6:** OR: y = +1 unless x₁ = x₂ = 0. Try **w** = [1, 1]ᵀ, b = −0.5: - (0,0): z = 0+0−0.5 = −0.5 → −1 ✓ - (0,1): z = 0+1−0.5 = 0.5 → +1 ✓ - (1,0): z = 1+0−0.5 = 0.5 → +1 ✓ - (1,1): z = 1+1−0.5 = 1.5 → +1 ✓ All correct! **Problem 7:** From Example 2: the final separator is **w** = [1, 0]ᵀ, b = −1. The decision boundary is x₁ = 1. Using the bias-trick augmentation **w̃** = [1, 0, −1]ᵀ (treating the bias as weight for a constant −1 input): For the four points with augmented inputs (appending −1): - (0,0,−1): **w̃**ᵀ**x̃** = 1 → margin distance = 1/√2 ≈ 0.707 - (0,1,−1): **w̃**ᵀ**x̃** = 1 → margin distance = 1/√2 - (1,0,−1): **w̃**ᵀ**x̃** = 2 → margin distance = 2/√2 - (1,1,−1): **w̃**ᵀ**x̃** = 2 → margin distance = 2/√2 γ = min distance = 1/√2. R = max ||x̃|| = max(1, √2, √2, √3) ≈ 1.414. Bound: (R/γ)² = (1.414/0.707)² = 2² = 4. The algorithm made 4 mistakes (in 2 epochs), matching the bound exactly. ✓

Summary

  1. A perceptron is a linear classifier: ŷ = sign(wx + b), dividing input space with a hyperplane whose normal is w.
  2. The perceptron learning algorithm updates weights only on mistakes: ww + y·x, moving the boundary toward correct classification.
  3. The perceptron convergence theorem guarantees finite convergence when data is linearly separable, with a mistake bound of (R/γ)².
  4. The XOR problem demonstrates the fundamental limitation: a single perceptron cannot solve non-linearly separable problems, motivating multi-layer networks.
  5. The perceptron update rule is a precursor to gradient descent — it moves weights in the direction that reduces a surrogate loss function.

Pitfalls


Quiz

Q1: A perceptron has w = [2, −3]ᵀ, b = 4. What is its output for input x = [1, 2]ᵀ?

A) +1 B) −1 C) 0 D) Cannot determine

Answer and Explanations **Correct: A) +1** z = 2·1 + (−3)·2 + 4 = 2 − 6 + 4 = 0. With the convention sign(0) = +1, output is +1. - A) +1: ✓ Correct. z = 0 ≥ 0, so sign(0) = +1. - B) −1: You may have computed z = −4 or used sign(0) = −1 by convention. With standard perceptron, sign(0) = +1. - C) 0: The perceptron outputs +1 or −1 only — the sign function never outputs 0. - D) Cannot determine: All quantities are given; the computation is straightforward.

Q2: What does the weight vector w represent geometrically?

A) The direction of the decision boundary line B) A vector perpendicular (normal) to the decision boundary C) The distance from the origin to the decision boundary D) The slope of the decision boundary

Answer and Explanations **Correct: B) A vector perpendicular (normal) to the decision boundary** For any two points **x₁**, **x₂** on the boundary: **w**ᵀ(**x₁** − **x₂**) = 0, so **w** is orthogonal to every vector lying in the hyperplane. - A) Incorrect. The direction of the line is parallel to the boundary, not perpendicular. - B) ✓ Correct. **w** is the normal vector. This is proven by the orthogonality condition above. - C) Incorrect. The distance from the origin is |b|/||**w**||, not **w** itself. - D) Incorrect. The slope is determined by the ratio of components of **w**, but **w** itself points perpendicular to the boundary.

Q3: In the perceptron update rule ww + y⁽ⁱ⁾x⁽ⁱ⁾, why do we add y⁽ⁱ⁾x⁽ⁱ⁾ when a positive example is misclassified as negative?

A) To rotate the decision boundary B) To increase the activation for that example, making it more positive C) To decrease the norm of the weight vector D) To add the example to a support vector set

Answer and Explanations **Correct: B) To increase the activation for that example, making it more positive** If y⁽ⁱ⁾ = +1 but we predicted −1, then **w**ᵀ**x**⁽ⁱ⁾ + b < 0 (too negative). Adding **x**⁽ⁱ⁾ to **w** increases the dot product **w**ᵀ**x**⁽ⁱ⁾, pushing the score toward positive. - A) The boundary does rotate, but that's a side effect. The primary mechanism is adjusting the activation. - B) ✓ Correct. The update directly increases **w**ᵀ**x**⁽ⁱ⁾, making the prediction more positive. - C) Incorrect. Adding a vector INCREASES the norm (unless **x** is opposite to **w**). - D) Incorrect. The perceptron doesn't maintain a support vector set — that's for SVMs.

Q4: Why can't a single perceptron solve the XOR problem?

A) XOR requires 3 inputs but perceptrons only accept 2 B) The XOR data points are not linearly separable C) The perceptron algorithm doesn't converge for more than 3 data points D) XOR requires real-valued outputs, not binary

Answer and Explanations **Correct: B) The XOR data points are not linearly separable** The four XOR points form a square where +1 points are on one diagonal and −1 points on the other. No single straight line can separate them. - A) Incorrect. XOR has 2 inputs, which a perceptron can handle. The issue is separability, not dimensionality. - B) ✓ Correct. XOR is the canonical example of non-linear separability. - C) Incorrect. The algorithm works fine for any number of points — it just won't converge if the data isn't separable. - D) Incorrect. XOR outputs are binary (0/1), which is exactly what perceptrons handle.

Q5: The perceptron convergence theorem states that the number of mistakes is bounded by (R/γ)², where γ is the margin. What happens to this bound if the data points are very spread out (large R)?

A) The bound decreases — convergence is faster B) The bound increases — convergence may be slower C) The bound stays the same D) The algorithm may never converge regardless of separability

Answer and Explanations **Correct: B) The bound increases — convergence may be slower** (R/γ)² grows quadratically with R. If data points are far from the origin (large R), each misclassification can cause a large weight update, potentially undoing previous progress. More mistakes may be needed. - A) Incorrect. Larger R makes the bound larger, not smaller. - B) ✓ Correct. R represents the maximum norm of input vectors — larger inputs mean larger updates, requiring more corrections. - C) Incorrect. The bound explicitly depends on R. - D) Incorrect. If the data is linearly separable, the algorithm will always converge — just more slowly with larger R.

Q6: Suppose we add a bias trick: augment each input with a constant 1, so = [x; 1] and = [w; b]. What is the decision boundary in terms of and ?

A) = 1 B) = b C) = 0 D) = ||||

Answer and Explanations **Correct: C) **w̃**ᵀ**x̃** = 0** With augmentation: **w̃** = [w₁, ..., wₙ, b]ᵀ and **x̃** = [x₁, ..., xₙ, 1]ᵀ. Then **w̃**ᵀ**x̃** = **w**ᵀ**x** + b·1 = **w**ᵀ**x** + b. The decision boundary is where this equals 0. This "folds" the bias into the weight vector, making the boundary pass through the origin in the augmented space. - A) Incorrect. That would give **w**ᵀ**x** + b = 1, which is a shifted boundary. - B) Incorrect. That would give **w**ᵀ**x** + b = b → **w**ᵀ**x** = 0 (bias disappears). - C) ✓ Correct. **w̃**ᵀ**x̃** = **w**ᵀ**x** + b = 0 is exactly the original decision boundary. - D) Incorrect. The norm has nothing to do with the decision boundary equation.

Q7: What is the key limitation of the perceptron that multi-layer networks overcome?

A) Perceptrons can only process binary inputs B) Perceptrons can only learn linear decision boundaries C) Perceptrons cannot have a bias term D) Perceptrons require the data to be normalized

Answer and Explanations **Correct: B) Perceptrons can only learn linear decision boundaries** A single perceptron computes **w**ᵀ**x** + b, which is a linear function. The sign of a linear function defines a linear (hyperplane) boundary. Multi-layer networks compose multiple linear boundaries with non-linear activation functions, enabling them to learn arbitrarily complex decision surfaces. - A) Incorrect. Perceptrons accept real-valued inputs. - B) ✓ Correct. Linear separability is the fundamental limitation that multi-layer networks overcome. - C) Incorrect. Perceptrons can and do have bias terms (either explicit or via the bias trick). - D) Incorrect. Normalization helps but isn't required; the algorithm works on unnormalized data.

Next Steps

Move on to 16-02 — Activation Functions to learn about the non-linear functions (sigmoid, tanh, ReLU) that transform linear perceptrons into powerful multi-layer neural networks.