Deep Learning Foundations

Deep Learning is a subfield of Machine Learning inspired by the structure and function of the human brain. While traditional ML algorithms often hit a performance ceiling as data volume increases, Deep Learning models (Neural Networks) continue to improve, making them the powerhouse behind modern AI like ChatGPT, facial recognition, and self-driving cars.

1. Neural Network Basics

A Neural Network is a collection of “neurons” arranged in layers. Information enters the Input Layer, is processed in one or more Hidden Layers, and the result is produced by the Output Layer.

Weights ($w$): These determine the “strength” of a signal. They are the parameters the model “learns.”
Biases ($b$): An extra constant added to the input to allow the model to shift the activation function.
Layering: The “Deep” in Deep Learning refers to having many hidden layers.

2. The Perceptron

The Perceptron is the simplest form of a neural network—a single-layer unit that makes a binary decision. It takes multiple inputs, multiplies them by weights, adds them up, and passes the result through a step function.

The Math: $z = \sum (w_i \cdot x_i) + b$
Logic: If the sum is above a certain threshold, the output is 1 (True); otherwise, it is 0 (False).
Example: A simple Perceptron could decide if you should go to a concert based on two inputs: Is it raining? and Is the ticket free?.

3. Activation Functions

Neurons need a way to decide whether they should “fire” (pass information to the next layer). Activation functions introduce non-linearity, allowing the network to learn complex patterns. Without them, a neural network would just be a giant linear regression model.

ReLU (Rectified Linear Unit): The most popular. It outputs the input if it’s positive, and zero otherwise. It’s fast and efficient.
Sigmoid: Squashes values between 0 and 1. Great for binary classification.
Softmax: Used in the final layer for multi-class classification. It turns outputs into probabilities that sum to 100%.
Tanh (Hyperbolic Tangent): Squashes values between -1 and 1. Often used in hidden layers.

4. Loss Functions

The Loss Function measures how “wrong” the model’s prediction is compared to the actual target. The goal of training is to minimize this loss.

Mean Squared Error (MSE): Used for Regression. Measures the average squared difference.
Binary Cross-Entropy: Used for Binary Classification. It punishes the model heavily if it is confident in the wrong answer.
Categorical Cross-Entropy: Used for Multi-class Classification (e.g., identifying if an image is a cat, dog, or bird).

5. Backpropagation & Gradient Descent

This is the “engine” of Deep Learning.

Forward Pass: Data goes through the network, and a prediction is made.
Loss Calculation: The error is calculated.
Backpropagation: The model works backward from the output to the input, calculating how much each weight contributed to the error. This uses the Chain Rule from calculus.
Gradient Descent: The model updates the weights in the opposite direction of the “gradient” (slope) to reduce the loss.

Gradient Descent Variants

Batch Gradient Descent: Updates weights after looking at the entire dataset. (Slow).
Stochastic Gradient Descent (SGD): Updates weights after every single data point. (Fast but “noisy”).
Adam (Adaptive Moment Estimation): The modern standard. It adjusts the learning rate for each weight automatically, making training much faster and more stable.

6. Vanishing & Exploding Gradients

In very deep networks, as we propagate the error backward, the gradients (updates) are multiplied together.

Vanishing Gradient: If the gradients are very small (e.g., between 0 and 1), multiplying them many times makes them shrink to nearly zero. The layers close to the input stop learning.
- Solution: Use ReLU instead of Sigmoid.
Exploding Gradient: If gradients are large ($> 1$), they grow exponentially as they go backward, causing the weights to change so drastically that the model becomes unstable (“NaN” errors).
- Solution: Gradient Clipping (capping the maximum value of a gradient).

Example: Image Recognition

Imagine training a network to recognize the digit “7”.

Input: A 28×28 pixel image (784 inputs).
Hidden Layers: First layers might detect simple edges $\rightarrow$ middle layers detect corners/loops $\rightarrow$ final layers detect the shape of a “7”.
Activation: ReLU is used in hidden layers; Softmax is used at the end to give a probability (e.g., “98% sure this is a 7”).
Training: If the model says it’s a “1”, Loss is high. Backpropagation tells the network which pixel-weights were misleading, and Adam Optimizer tweaks them slightly so next time it’s more likely to say “7”.

Log In

Sign Up