Mathematics is the engine behind AI & ML.
While libraries hide formulas, true understanding comes from knowing what the math is doing.
This module builds strong intuition so learners can design models, debug training issues, and explain concepts in interviews.
Linear Algebra is the language of data and models.
Almost everything in ML — datasets, weights, activations — is represented using vectors and matrices.
Scalar
A single numerical value.
Example:
Used to:
Vector
An ordered list of numbers representing features.
Example:
[size, bedrooms, price]In ML:
Matrix
A 2D collection of numbers (rows × columns).
Example:
In ML:
Matrix operations allow efficient computation on large datasets.
Matrix Addition
Matrix Multiplication
Example:
Prediction = Data Matrix × Weight Matrix
This single operation computes predictions for thousands of samples at once.
Transpose
The dot product measures similarity between two vectors.
Formula:
a · b = a₁b₁ + a₂b₂ + ... + aₙbₙ
Interpretation:
ML Use:
These represent important directions in data.
Simple idea:
ML Use:
Matrix inversion allows solving systems of equations.
Example:
Important Note:
In practice:
ML is fundamentally about uncertainty and prediction.
Probability helps models make informed guesses, not exact answers.
Mean
Median
Variance
ML Importance:
Basic Rules
Conditional Probability
Used in:
A random variable maps outcomes to numbers.
Types:
ML models learn distributions of random variables.
Distributions describe how data is spread.
Normal Distribution
Used in:
Bernoulli & Binomial
Bayes theorem updates beliefs using evidence.
Formula:
P(A|B) = P(B|A) * P(A) / P(B)
Intuition:
ML Use:
Used to validate assumptions using data.
Steps:
ML Use:
Covariance
Correlation
ML Use:
Calculus enables learning in ML.
Without calculus, models cannot improve.
Derivative measures rate of change.
Example:
ML Meaning:
Used when functions depend on multiple variables.
ML Use:
Gradient is a vector of partial derivatives.
Interpretation:
In ML:
Allows differentiation of composed functions.
ML Use:
Without chain rule:
Optimization finds best parameters.
Gradient Descent
Learning Rate
Measure error between prediction and reality.
Examples:
Optimization minimizes loss.
Without math:
With math:
Definition: A single numerical value (0-dimensional).
import numpy as np
# Scalars in Python
learning_rate = 0.001
temperature = 37.5
count = 100
# In NumPy
scalar = np.array(5)
print(scalar.ndim) # 0 dimensions
Use in ML: Learning rates, regularization parameters, loss values, accuracy scores.
Definition: 1-dimensional array of numbers (ordered list).
Mathematical Notation:
v = [v₁, v₂, v₃, ..., vₙ]
Types:
import numpy as np
# Creating vectors
row_vector = np.array([1, 2, 3, 4])
column_vector = np.array([[1], [2], [3], [4]])
# Alternative column vector
col_vec = np.array([1, 2, 3, 4]).reshape(-1, 1)
print(row_vector.shape) # (4,)
print(column_vector.shape) # (4, 1)
# Vector properties
length = len(row_vector) # 4
dimension = row_vector.ndim # 1
ML Examples:
# Feature vector (one data point)
house_features = np.array([1500, 3, 2, 2010]) # sqft, bedrooms, bathrooms, year
# [square_feet, num_bedrooms, num_bathrooms, year_built]
# Word embedding vector (represents a word)
word_embedding = np.array([0.2, -0.4, 0.7, 0.1, -0.3])
# Model predictions for multiple classes
probability_vector = np.array([0.1, 0.2, 0.05, 0.65]) # Class probabilities
Vector Operations:
# Vector addition
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v_sum = v1 + v2 # [5, 7, 9]
# Scalar multiplication
scalar = 2
v_scaled = scalar * v1 # [2, 4, 6]
# Element-wise multiplication
v_mult = v1 * v2 # [4, 10, 18]
# Vector magnitude (L2 norm)
magnitude = np.linalg.norm(v1) # √(1² + 2² + 3²) = √14
# Unit vector (normalized)
unit_vector = v1 / magnitude
Vector Norm (Magnitude):
# L2 norm (Euclidean distance)
v = np.array([3, 4])
l2_norm = np.linalg.norm(v) # √(3² + 4²) = 5
# L1 norm (Manhattan distance)
l1_norm = np.sum(np.abs(v)) # |3| + |4| = 7
# Used in regularization
def l2_regularization(weights, lambda_param=0.01):
return lambda_param * np.linalg.norm(weights) ** 2
def l1_regularization(weights, lambda_param=0.01):
return lambda_param * np.sum(np.abs(weights))
Definition: 2-dimensional array of numbers (rows × columns).
Mathematical Notation:
⎡ a₁₁ a₁₂ a₁₃ ⎤
A = ⎢ a₂₁ a₂₂ a₂₃ ⎥
⎣ a₃₁ a₃₂ a₃₃ ⎦
import numpy as np
# Creating matrices
matrix = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
print(matrix.shape) # (3, 3) - 3 rows, 3 columns
print(matrix.ndim) # 2 dimensions
# Common matrix types
zeros = np.zeros((3, 4)) # All zeros
ones = np.ones((2, 3)) # All ones
identity = np.eye(3) # Identity matrix (diagonal 1s)
random = np.random.rand(3, 3) # Random values [0, 1)
# Identity matrix (I)
# ⎡ 1 0 0 ⎤
# ⎢ 0 1 0 ⎥
# ⎣ 0 0 1 ⎦
ML Examples:
# Dataset matrix (rows=samples, columns=features)
X = np.array([
[1500, 3, 2010], # House 1
[2000, 4, 2015], # House 2
[1200, 2, 2005], # House 3
[1800, 3, 2012] # House 4
])
# Shape: (4, 3) - 4 samples, 3 features
# Weight matrix in neural network
W = np.array([
[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6]
])
# Shape: (2, 3) - connects 3 inputs to 2 outputs
# Image as matrix
image = np.random.rand(28, 28) # 28×28 grayscale image
rgb_image = np.random.rand(28, 28, 3) # 28×28×3 color image
Matrix Properties:
matrix = np.array([[1, 2], [3, 4]])
# Transpose (flip rows and columns)
transposed = matrix.T
# [[1, 2], [[1, 3],
# [3, 4]] → [2, 4]]
# Diagonal
diagonal = np.diag(matrix) # [1, 4]
# Trace (sum of diagonal)
trace = np.trace(matrix) # 1 + 4 = 5
# Determinant
det = np.linalg.det(matrix) # 1*4 - 2*3 = -2
# Rank
rank = np.linalg.matrix_rank(matrix)
Rule: Matrices must have the same dimensions.
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
# Addition
C = A + B
# [[1+5, 2+6], [[6, 8],
# [3+7, 4+8]] = [10, 12]]
# Subtraction
D = A - B
# [[-4, -4],
# [-4, -4]]
# Scalar multiplication
E = 2 * A
# [[2, 4],
# [6, 8]]
Rule: For A(m×n) × B(n×p), the number of columns in A must equal rows in B. Result is (m×p).
Element calculation:
C[i,j] = Σ A[i,k] × B[k,j]
# Example 1: Basic multiplication
A = np.array([[1, 2],
[3, 4]]) # 2×2
B = np.array([[5, 6],
[7, 8]]) # 2×2
C = np.dot(A, B) # or A @ B
# [[1*5 + 2*7, 1*6 + 2*8], [[19, 22],
# [3*5 + 4*7, 3*6 + 4*8]] = [43, 50]]
# Example 2: Different dimensions
A = np.array([[1, 2, 3]]) # 1×3
B = np.array([[4], [5], [6]]) # 3×1
C = np.dot(A, B)
# [[1*4 + 2*5 + 3*6]] = [[32]] # 1×1 result
# Example 3: Neural network forward pass
X = np.array([[1, 2, 3]]) # 1×3 (input)
W = np.array([[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.6]]) # 3×2 (weights)
output = np.dot(X, W)
# [[1*0.1 + 2*0.3 + 3*0.5, 1*0.2 + 2*0.4 + 3*0.6]]
# = [[2.2, 2.8]] # 1×2 (output for 2 neurons)
ML Application – Batch Processing:
# Multiple samples (batch)
X = np.array([
[1, 2, 3], # Sample 1
[4, 5, 6], # Sample 2
[7, 8, 9] # Sample 3
]) # 3×3 (3 samples, 3 features)
W = np.array([
[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.6]
]) # 3×2 (weights)
# Forward pass for entire batch
output = np.dot(X, W) # 3×2
# Each row is the output for one sample
print(output)
# [[2.2, 2.8],
# [4.9, 6.4],
# [7.6, 10.0]]
Element-wise vs Matrix Multiplication:
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
# Element-wise (Hadamard product)
elementwise = A * B
# [[1*5, 2*6], [[5, 12],
# [3*7, 4*8]] = [21, 32]]
# Matrix multiplication
matrix_mult = np.dot(A, B) # or A @ B
# [[19, 22],
# [43, 50]]
# Linear transformation
A = np.array([
[2, 0],
[0, 3]
]) # 2×2 matrix
v = np.array([1, 2]) # 2D vector
result = np.dot(A, v)
# [2*1 + 0*2, = [2,
# 0*1 + 3*2] 6]
# ML Example: Linear layer
weights = np.array([
[0.5, -0.3, 0.8],
[0.2, 0.6, -0.4]
]) # 2×3
features = np.array([1.0, 2.0, 3.0]) # 3 features
output = np.dot(weights, features)
# [0.5*1 + (-0.3)*2 + 0.8*3, = [2.3,
# 0.2*1 + 0.6*2 + (-0.4)*3] 0.2]
Definition: Scalar result from multiplying corresponding elements and summing.
Formula:
a · b = Σ aᵢ × bᵢ = a₁b₁ + a₂b₂ + ... + aₙbₙ
Geometric interpretation:
a · b = ||a|| × ||b|| × cos(θ)
where θ is the angle between vectors.
# Dot product calculation
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot_product = np.dot(a, b) # 1*4 + 2*5 + 3*6 = 32
# Manual calculation
manual = sum([ai * bi for ai, bi in zip(a, b)])
# Using @ operator (Python 3.5+)
result = a @ b
Properties:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([7, 8, 9])
k = 2
# Commutative
print(np.dot(a, b) == np.dot(b, a)) # True
# Distributive
print(np.dot(a, b + c) == np.dot(a, b) + np.dot(a, c)) # True
# Scalar multiplication
print(np.dot(k * a, b) == k * np.dot(a, b)) # True
# Orthogonality (perpendicular vectors)
v1 = np.array([1, 0])
v2 = np.array([0, 1])
print(np.dot(v1, v2)) # 0 (orthogonal)
Similarity Measure:
def cosine_similarity(a, b):
"""
Measures similarity between vectors.
Range: [-1, 1]
1 = identical direction
0 = orthogonal
-1 = opposite direction
"""
dot_prod = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_prod / (norm_a * norm_b)
# Example: Document similarity
doc1 = np.array([1, 2, 1, 0]) # Word frequencies
doc2 = np.array([2, 1, 0, 1])
doc3 = np.array([1, 2, 1, 0]) # Same as doc1
print(cosine_similarity(doc1, doc2)) # 0.632... (somewhat similar)
print(cosine_similarity(doc1, doc3)) # 1.0 (identical)
ML Applications:
# 1. Neural network forward pass
def forward_pass(X, W, b):
"""
X: input features
W: weight matrix
b: bias vector
"""
return np.dot(X, W) + b
# 2. Attention mechanism (simplified)
def attention_score(query, key):
"""Calculate attention between query and key."""
return np.dot(query, key) / np.sqrt(len(query))
# 3. Recommendation system
def predict_rating(user_vector, item_vector):
"""Predict user rating for item."""
return np.dot(user_vector, item_vector)
user = np.array([0.8, 0.2, 0.9]) # User preferences
movie = np.array([0.9, 0.1, 0.8]) # Movie features
rating = predict_rating(user, movie) # Predicted rating
Definition: For a square matrix A, if:
A × v = λ × v
Then v is an eigenvector and λ is the corresponding eigenvalue.
Intuition: Eigenvectors are special directions that only get scaled (not rotated) when transformed by the matrix.
import numpy as np
# Example matrix
A = np.array([
[4, 2],
[1, 3]
])
# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
# [5. 2.]
print("Eigenvectors:")
print(eigenvectors)
# [[ 0.89442719 -0.70710678]
# [ 0.4472136 0.70710678]]
# Verification: A × v = λ × v
v1 = eigenvectors[:, 0] # First eigenvector
lambda1 = eigenvalues[0] # First eigenvalue
Av = np.dot(A, v1)
lambda_v = lambda1 * v1
print("A × v:", Av)
print("λ × v:", lambda_v)
# They should be equal (within floating point error)
Properties:
# 1. Trace = sum of eigenvalues
trace_A = np.trace(A)
sum_eigenvalues = np.sum(eigenvalues)
print(f"Trace: {trace_A}, Sum of eigenvalues: {sum_eigenvalues}")
# 2. Determinant = product of eigenvalues
det_A = np.linalg.det(A)
prod_eigenvalues = np.prod(eigenvalues)
print(f"Det: {det_A}, Product of eigenvalues: {prod_eigenvalues}")
# 3. For symmetric matrix, eigenvectors are orthogonal
S = np.array([[2, 1], [1, 2]])
evals, evecs = np.linalg.eig(S)
v1, v2 = evecs[:, 0], evecs[:, 1]
print(f"Dot product: {np.dot(v1, v2)}") # ≈ 0 (orthogonal)
Principal Component Analysis (PCA):
def pca(X, n_components=2):
"""
Principal Component Analysis using eigendecomposition.
Args:
X: Data matrix (n_samples × n_features)
n_components: Number of principal components
Returns:
Transformed data
"""
# Center the data
X_centered = X - np.mean(X, axis=0)
# Compute covariance matrix
cov_matrix = np.cov(X_centered.T)
# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Sort by eigenvalues (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Select top n_components
top_eigenvectors = eigenvectors[:, :n_components]
# Transform data
X_transformed = np.dot(X_centered, top_eigenvectors)
# Explained variance
explained_var = eigenvalues[:n_components] / np.sum(eigenvalues)
return X_transformed, explained_var
# Example usage
X = np.array([
[2.5, 2.4],
[0.5, 0.7],
[2.2, 2.9],
[1.9, 2.2],
[3.1, 3.0]
])
X_pca, explained_var = pca(X, n_components=2)
print("Explained variance:", explained_var)
print("Transformed data:")
print(X_pca)
Graph Analysis:
def compute_pagerank(adjacency_matrix, damping=0.85, max_iter=100):
"""
PageRank using power iteration (related to eigenvectors).
The principal eigenvector of the transition matrix
represents the steady-state probability distribution.
"""
n = len(adjacency_matrix)
# Create transition matrix
out_degree = adjacency_matrix.sum(axis=1)
out_degree[out_degree == 0] = 1 # Avoid division by zero
transition = adjacency_matrix / out_degree[:, np.newaxis]
# Add damping factor
transition = damping * transition + (1 - damping) / n
# Power iteration to find principal eigenvector
rank = np.ones(n) / n
for _ in range(max_iter):
rank_new = transition.T @ rank
if np.allclose(rank, rank_new):
break
rank = rank_new
return rank
# Example: Simple web graph
# Page 0 → Page 1, Page 1 → Page 2, Page 2 → Page 0
adjacency = np.array([
[0, 1, 0],
[0, 0, 1],
[1, 0, 0]
])
pagerank = compute_pagerank(adjacency)
print("PageRank scores:", pagerank)
Spectral Clustering:
def spectral_clustering(similarity_matrix, n_clusters=2):
"""
Clustering using eigenvectors of graph Laplacian.
"""
# Compute degree matrix
D = np.diag(similarity_matrix.sum(axis=1))
# Compute graph Laplacian
L = D - similarity_matrix
# Compute eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(L)
# Sort by eigenvalues
idx = eigenvalues.argsort()
eigenvectors = eigenvectors[:, idx]
# Use first n_clusters eigenvectors
features = eigenvectors[:, :n_clusters]
# Apply k-means on these features
# (simplified, would use actual k-means)
return features
Definition: For matrix A, its inverse A⁻¹ satisfies:
A × A⁻¹ = A⁻¹ × A = I
Requirements:
# Computing inverse
A = np.array([
[4, 7],
[2, 6]
])
A_inv = np.linalg.inv(A)
print("A:")
print(A)
print("\nA inverse:")
print(A_inv)
# Verification: A × A⁻¹ = I
identity = np.dot(A, A_inv)
print("\nA × A⁻¹:")
print(identity)
# [[1. 0.]
# [0. 1.]]
2×2 Matrix Inverse (Manual):
def inverse_2x2(A):
"""
For 2×2 matrix:
A = [[a, b],
[c, d]]
A⁻¹ = (1/det) × [[ d, -b],
[-c, a]]
"""
a, b = A[0, 0], A[0, 1]
c, d = A[1, 0], A[1, 1]
det = a * d - b * c
if det == 0:
raise ValueError("Matrix is singular (not invertible)")
return (1 / det) * np.array([
[ d, -b],
[-c, a]
])
# Example
A = np.array([[4, 7], [2, 6]])
A_inv_manual = inverse_2x2(A)
print(np.allclose(A_inv_manual, A_inv)) # True
Solving Linear Systems:
# Solve A × x = b
A = np.array([
[3, 1],
[1, 2]
])
b = np.array([9, 8])
# Method 1: Using inverse (not recommended for large systems)
x = np.dot(np.linalg.inv(A), b)
print("Solution using inverse:", x)
# Method 2: Using solve (more efficient and stable)
x = np.linalg.solve(A, b)
print("Solution using solve:", x)
# Verification
print("Verification A × x:", np.dot(A, x))
print("Should equal b:", b)
ML Application – Linear Regression (Normal Equation):
def linear_regression_normal_equation(X, y):
"""
Solve linear regression analytically:
θ = (X^T × X)^(-1) × X^T × y
Args:
X: Feature matrix (m × n)
y: Target vector (m,)
Returns:
θ: Optimal parameters (n,)
"""
# Add bias term (column of 1s)
X_bias = np.c_[np.ones(len(X)), X]
# Compute (X^T × X)^(-1) × X^T × y
XTX = np.dot(X_bias.T, X_bias)
XTX_inv = np.linalg.inv(XTX)
XTy = np.dot(X_bias.T, y)
theta = np.dot(XTX_inv, XTy)
return theta
# Example: House price prediction
X = np.array([
[1000], # sqft
[1500],
[2000],
[2500]
])
y = np.array([200000, 250000, 300000, 350000]) # prices
theta = linear_regression_normal_equation(X, y)
print("Intercept:", theta[0])
print("Coefficient:", theta[1])
# Prediction
new_house = np.array([[1800]])
X_new = np.c_[np.ones(len(new_house)), new_house]
price_pred = np.dot(X_new, theta)
print(f"Predicted price for 1800 sqft: ${price_pred[0]:,.0f}")
Pseudoinverse (for non-square matrices):
# Moore-Penrose pseudoinverse
A = np.array([
[1, 2],
[3, 4],
[5, 6]
]) # 3×2 (not square)
A_pinv = np.linalg.pinv(A)
print("Pseudoinverse shape:", A_pinv.shape) # 2×3
# Properties
print("\nA × A⁺ × A ≈ A:")
print(np.allclose(A, A @ A_pinv @ A)) # True
print("\nA⁺ × A × A⁺ ≈ A⁺:")
print(np.allclose(A_pinv, A_pinv @ A @ A_pinv)) # True
When to Use What:
# For solving A × x = b:
# 1. Small system, A is square and invertible
# → Use np.linalg.solve() (not inv!)
x = np.linalg.solve(A, b)
# 2. Overdetermined system (more equations than unknowns)
# → Use least squares
x = np.linalg.lstsq(A, b, rcond=None)[0]
# 3. Underdetermined system (more unknowns than equations)
# → Use pseudoinverse for minimum norm solution
x = np.linalg.pinv(A) @ b
# 4. Very large system
# → Use iterative methods (gradient descent)
Definition: Sum of all values divided by count.
Formula:
μ = (Σ xᵢ) / n
import numpy as np
# Sample data
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
# Mean
mean = np.mean(data)
print(f"Mean: {mean}") # 5.0
# Manual calculation
manual_mean = sum(data) / len(data)
# Weighted mean
values = np.array([85, 90, 78])
weights = np.array([0.2, 0.3, 0.5]) # Exam weights
weighted_mean = np.average(values, weights=weights)
print(f"Weighted mean: {weighted_mean}") # 82.9
ML Application:
# Feature scaling using mean
def standardize(X):
"""Z-score normalization."""
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
return (X - mean) / std
# Example
features = np.array([
[100, 2.5],
[150, 3.0],
[200, 3.5]
])
scaled = standardize(features)
print("Standardized features:")
print(scaled)
print("Mean after scaling:", np.mean(scaled, axis=0)) # [0, 0]
Definition: Middle value when data is sorted.
data = np.array([1, 3, 3, 6, 7, 8, 9])
median = np.median(data)
print(f"Median: {median}") # 6.0
# For even number of elements
data_even = np.array([1, 2, 3, 4, 5, 6])
median_even = np.median(data_even)
print(f"Median (even): {median_even}") # 3.5 (average of 3 and 4)
# Manual calculation
def calculate_median(data):
sorted_data = np.sort(data)
n = len(sorted_data)
if n % 2 == 1:
return sorted_data[n // 2]
else:
mid1 = sorted_data[n // 2 - 1]
mid2 = sorted_data[n // 2]
return (mid1 + mid2) / 2
# Median is robust to outliers
data_with_outlier = np.array([1, 2, 3, 4, 1000])
print(f"Mean with outlier: {np.mean(data_with_outlier)}") # 202.0
print(f"Median with outlier: {np.median(data_with_outlier)}") # 3.0
Variance: Average squared deviation from mean.
Formula:
σ² = Σ(xᵢ - μ)² / n (population)
s² = Σ(xᵢ - x̄)² / (n-1) (sample)
Standard Deviation: Square root of variance.
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
# Variance
variance_pop = np.var(data) # Population variance (ddof=0)
variance_sample = np.var(data, ddof=1) # Sample variance (ddof=1)
print(f"Population variance: {variance_pop}") # 4.0
print(f"Sample variance: {variance_sample}")