Mathematics for AI & Machine Learning

Mathematics is the engine behind AI & ML.
While libraries hide formulas, true understanding comes from knowing what the math is doing.
This module builds strong intuition so learners can design models, debug training issues, and explain concepts in interviews.

Part 1: Linear Algebra

Linear Algebra is the language of data and models.
Almost everything in ML — datasets, weights, activations — is represented using vectors and matrices.

Scalars, Vectors, and Matrices

Scalar
A single numerical value.

Example:

Learning rate = 0.01
Bias = 2.5

Used to:

Scale values
Represent constants

Vector
An ordered list of numbers representing features.

Example:

A house represented as
[size, bedrooms, price]

In ML:

Input features
Weight vectors
Output predictions

Matrix
A 2D collection of numbers (rows × columns).

Example:

Dataset where each row is a sample and each column is a feature

In ML:

Training data
Weight matrices in neural networks
Image data

Matrix Operations

Matrix operations allow efficient computation on large datasets.

Matrix Addition

Add corresponding elements
Used in updating weights

Matrix Multiplication

Core operation in ML
Combines features and weights

Example:

Prediction = Data Matrix × Weight Matrix

This single operation computes predictions for thousands of samples at once.

Transpose

Swaps rows and columns
Used in gradient calculations

Dot Product

The dot product measures similarity between two vectors.

Formula:

a · b = a₁b₁ + a₂b₂ + ... + aₙbₙ

Interpretation:

High value → vectors point in same direction
Zero → vectors are orthogonal
Negative → opposite direction

ML Use:

Linear regression
Cosine similarity
Recommendation systems

Eigenvalues & Eigenvectors

These represent important directions in data.

Simple idea:

Eigenvector → direction that does not change
Eigenvalue → magnitude of that direction

ML Use:

PCA (dimensionality reduction)
Understanding data variance
Stability of models

Matrix Inversion

Matrix inversion allows solving systems of equations.

Example:

Used in Normal Equation of linear regression

Important Note:

Not all matrices are invertible
Expensive for large matrices

In practice:

Gradient descent is preferred over inversion

Part 2: Probability & Statistics

ML is fundamentally about uncertainty and prediction.
Probability helps models make informed guesses, not exact answers.

Mean, Median, and Variance

Mean

Average value
Represents central tendency

Median

Middle value
Robust to outliers

Variance

Measures data spread
High variance → data is scattered

ML Importance:

Feature scaling
Understanding noise
Bias-variance tradeoff

Probability Rules

Basic Rules

Probability lies between 0 and 1
Sum of probabilities = 1

Conditional Probability

Probability of A given B

Used in:

Bayesian models
Classification problems

Random Variables

A random variable maps outcomes to numbers.

Types:

Discrete (classification labels)
Continuous (prices, temperature)

ML models learn distributions of random variables.

Probability Distributions

Distributions describe how data is spread.

Normal Distribution

Bell-shaped curve
Many natural phenomena follow this

Used in:

Assumptions in ML models
Noise modeling

Bernoulli & Binomial

Binary outcomes
Used in classification

Bayes Theorem

Bayes theorem updates beliefs using evidence.

Formula:

P(A|B) = P(B|A) * P(A) / P(B)

Intuition:

Prior belief → updated using new data

ML Use:

Naive Bayes classifier
Medical diagnosis
Spam filtering

Hypothesis Testing

Used to validate assumptions using data.

Steps:

Null hypothesis
Alternative hypothesis
Test statistic
P-value
Decision

ML Use:

Feature significance
Model comparison
A/B testing

Correlation & Covariance

Covariance

Measures how two variables change together

Correlation

Normalized version of covariance
Values between -1 and 1

ML Use:

Feature selection
Detect multicollinearity

Part 3: Calculus

Calculus enables learning in ML.
Without calculus, models cannot improve.

Derivatives

Derivative measures rate of change.

Example:

How loss changes with weight

ML Meaning:

Direction to adjust parameters

Partial Derivatives

Used when functions depend on multiple variables.

ML Use:

Neural networks
Multivariate optimization

Gradients

Gradient is a vector of partial derivatives.

Interpretation:

Points in direction of steepest increase

In ML:

We move in opposite direction (gradient descent)

Chain Rule

Allows differentiation of composed functions.

ML Use:

Backpropagation in neural networks

Without chain rule:

Deep learning would not exist

Optimization Basics

Optimization finds best parameters.

Gradient Descent

Iteratively update weights
Move toward minimum loss

Learning Rate

Step size
Too high → overshoot
Too low → slow learning

Loss Functions

Measure error between prediction and reality.

Examples:

Mean Squared Error
Cross-Entropy Loss

Optimization minimizes loss.

Why Mathematics is Non-Negotiable

Without math:

Models become black boxes
Debugging is guesswork
Interview answers lack depth

With math:

You understand why models work
You tune performance intelligently
You stand out as a serious AI engineer

1.1 Scalars, Vectors, and Matrices

Scalars

Definition: A single numerical value (0-dimensional).

import numpy as np

# Scalars in Python
learning_rate = 0.001
temperature = 37.5
count = 100

# In NumPy
scalar = np.array(5)
print(scalar.ndim)  # 0 dimensions

Use in ML: Learning rates, regularization parameters, loss values, accuracy scores.

Vectors

Definition: 1-dimensional array of numbers (ordered list).

Mathematical Notation:

v = [v₁, v₂, v₃, ..., vₙ]

Types:

Row vector: [1, 2, 3]
Column vector: [[1], [2], [3]]

import numpy as np

# Creating vectors
row_vector = np.array([1, 2, 3, 4])
column_vector = np.array([[1], [2], [3], [4]])

# Alternative column vector
col_vec = np.array([1, 2, 3, 4]).reshape(-1, 1)

print(row_vector.shape)     # (4,)
print(column_vector.shape)  # (4, 1)

# Vector properties
length = len(row_vector)    # 4
dimension = row_vector.ndim # 1

ML Examples:

# Feature vector (one data point)
house_features = np.array([1500, 3, 2, 2010])  # sqft, bedrooms, bathrooms, year
# [square_feet, num_bedrooms, num_bathrooms, year_built]

# Word embedding vector (represents a word)
word_embedding = np.array([0.2, -0.4, 0.7, 0.1, -0.3])

# Model predictions for multiple classes
probability_vector = np.array([0.1, 0.2, 0.05, 0.65])  # Class probabilities

Vector Operations:

# Vector addition
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v_sum = v1 + v2  # [5, 7, 9]

# Scalar multiplication
scalar = 2
v_scaled = scalar * v1  # [2, 4, 6]

# Element-wise multiplication
v_mult = v1 * v2  # [4, 10, 18]

# Vector magnitude (L2 norm)
magnitude = np.linalg.norm(v1)  # √(1² + 2² + 3²) = √14

# Unit vector (normalized)
unit_vector = v1 / magnitude

Vector Norm (Magnitude):

# L2 norm (Euclidean distance)
v = np.array([3, 4])
l2_norm = np.linalg.norm(v)  # √(3² + 4²) = 5

# L1 norm (Manhattan distance)
l1_norm = np.sum(np.abs(v))  # |3| + |4| = 7

# Used in regularization
def l2_regularization(weights, lambda_param=0.01):
    return lambda_param * np.linalg.norm(weights) ** 2

def l1_regularization(weights, lambda_param=0.01):
    return lambda_param * np.sum(np.abs(weights))

Matrices

Definition: 2-dimensional array of numbers (rows × columns).

Mathematical Notation:

     ⎡ a₁₁  a₁₂  a₁₃ ⎤
A =  ⎢ a₂₁  a₂₂  a₂₃ ⎥
     ⎣ a₃₁  a₃₂  a₃₃ ⎦

import numpy as np

# Creating matrices
matrix = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

print(matrix.shape)  # (3, 3) - 3 rows, 3 columns
print(matrix.ndim)   # 2 dimensions

# Common matrix types
zeros = np.zeros((3, 4))      # All zeros
ones = np.ones((2, 3))        # All ones
identity = np.eye(3)          # Identity matrix (diagonal 1s)
random = np.random.rand(3, 3) # Random values [0, 1)

# Identity matrix (I)
# ⎡ 1  0  0 ⎤
# ⎢ 0  1  0 ⎥
# ⎣ 0  0  1 ⎦

ML Examples:

# Dataset matrix (rows=samples, columns=features)
X = np.array([
    [1500, 3, 2010],  # House 1
    [2000, 4, 2015],  # House 2
    [1200, 2, 2005],  # House 3
    [1800, 3, 2012]   # House 4
])
# Shape: (4, 3) - 4 samples, 3 features

# Weight matrix in neural network
W = np.array([
    [0.1, 0.2, 0.3],
    [0.4, 0.5, 0.6]
])
# Shape: (2, 3) - connects 3 inputs to 2 outputs

# Image as matrix
image = np.random.rand(28, 28)  # 28×28 grayscale image
rgb_image = np.random.rand(28, 28, 3)  # 28×28×3 color image

Matrix Properties:

matrix = np.array([[1, 2], [3, 4]])

# Transpose (flip rows and columns)
transposed = matrix.T
# [[1, 2],     [[1, 3],
#  [3, 4]]  →   [2, 4]]

# Diagonal
diagonal = np.diag(matrix)  # [1, 4]

# Trace (sum of diagonal)
trace = np.trace(matrix)  # 1 + 4 = 5

# Determinant
det = np.linalg.det(matrix)  # 1*4 - 2*3 = -2

# Rank
rank = np.linalg.matrix_rank(matrix)

1.2 Matrix Operations

Matrix Addition and Subtraction

Rule: Matrices must have the same dimensions.

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Addition
C = A + B
# [[1+5, 2+6],   [[6,  8],
#  [3+7, 4+8]] =  [10, 12]]

# Subtraction
D = A - B
# [[-4, -4],
#  [-4, -4]]

# Scalar multiplication
E = 2 * A
# [[2, 4],
#  [6, 8]]

Matrix Multiplication

Rule: For A(m×n) × B(n×p), the number of columns in A must equal rows in B. Result is (m×p).

Element calculation:

C[i,j] = Σ A[i,k] × B[k,j]

# Example 1: Basic multiplication
A = np.array([[1, 2], 
              [3, 4]])     # 2×2
B = np.array([[5, 6], 
              [7, 8]])     # 2×2

C = np.dot(A, B)  # or A @ B
# [[1*5 + 2*7,  1*6 + 2*8],     [[19, 22],
#  [3*5 + 4*7,  3*6 + 4*8]]  =   [43, 50]]

# Example 2: Different dimensions
A = np.array([[1, 2, 3]])        # 1×3
B = np.array([[4], [5], [6]])    # 3×1

C = np.dot(A, B)  
# [[1*4 + 2*5 + 3*6]] = [[32]]  # 1×1 result

# Example 3: Neural network forward pass
X = np.array([[1, 2, 3]])       # 1×3 (input)
W = np.array([[0.1, 0.2],
              [0.3, 0.4],
              [0.5, 0.6]])      # 3×2 (weights)

output = np.dot(X, W)
# [[1*0.1 + 2*0.3 + 3*0.5,  1*0.2 + 2*0.4 + 3*0.6]]
# = [[2.2, 2.8]]  # 1×2 (output for 2 neurons)

ML Application – Batch Processing:

# Multiple samples (batch)
X = np.array([
    [1, 2, 3],    # Sample 1
    [4, 5, 6],    # Sample 2
    [7, 8, 9]     # Sample 3
])  # 3×3 (3 samples, 3 features)

W = np.array([
    [0.1, 0.2],
    [0.3, 0.4],
    [0.5, 0.6]
])  # 3×2 (weights)

# Forward pass for entire batch
output = np.dot(X, W)  # 3×2
# Each row is the output for one sample
print(output)
# [[2.2,  2.8],
#  [4.9,  6.4],
#  [7.6, 10.0]]

Element-wise vs Matrix Multiplication:

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Element-wise (Hadamard product)
elementwise = A * B
# [[1*5, 2*6],   [[5,  12],
#  [3*7, 4*8]] =  [21, 32]]

# Matrix multiplication
matrix_mult = np.dot(A, B)  # or A @ B
# [[19, 22],
#  [43, 50]]

Matrix-Vector Multiplication

# Linear transformation
A = np.array([
    [2, 0],
    [0, 3]
])  # 2×2 matrix

v = np.array([1, 2])  # 2D vector

result = np.dot(A, v)
# [2*1 + 0*2,  = [2,
#  0*1 + 3*2]     6]

# ML Example: Linear layer
weights = np.array([
    [0.5, -0.3, 0.8],
    [0.2,  0.6, -0.4]
])  # 2×3

features = np.array([1.0, 2.0, 3.0])  # 3 features

output = np.dot(weights, features)
# [0.5*1 + (-0.3)*2 + 0.8*3,  = [2.3,
#  0.2*1 +   0.6*2 + (-0.4)*3]   0.2]

1.3 Dot Product

Definition: Scalar result from multiplying corresponding elements and summing.

Formula:

a · b = Σ aᵢ × bᵢ = a₁b₁ + a₂b₂ + ... + aₙbₙ

Geometric interpretation:

a · b = ||a|| × ||b|| × cos(θ)

where θ is the angle between vectors.

# Dot product calculation
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

dot_product = np.dot(a, b)  # 1*4 + 2*5 + 3*6 = 32

# Manual calculation
manual = sum([ai * bi for ai, bi in zip(a, b)])

# Using @ operator (Python 3.5+)
result = a @ b

Properties:

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([7, 8, 9])
k = 2

# Commutative
print(np.dot(a, b) == np.dot(b, a))  # True

# Distributive
print(np.dot(a, b + c) == np.dot(a, b) + np.dot(a, c))  # True

# Scalar multiplication
print(np.dot(k * a, b) == k * np.dot(a, b))  # True

# Orthogonality (perpendicular vectors)
v1 = np.array([1, 0])
v2 = np.array([0, 1])
print(np.dot(v1, v2))  # 0 (orthogonal)

Similarity Measure:

def cosine_similarity(a, b):
    """
    Measures similarity between vectors.
    Range: [-1, 1]
    1 = identical direction
    0 = orthogonal
    -1 = opposite direction
    """
    dot_prod = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_prod / (norm_a * norm_b)

# Example: Document similarity
doc1 = np.array([1, 2, 1, 0])  # Word frequencies
doc2 = np.array([2, 1, 0, 1])
doc3 = np.array([1, 2, 1, 0])  # Same as doc1

print(cosine_similarity(doc1, doc2))  # 0.632... (somewhat similar)
print(cosine_similarity(doc1, doc3))  # 1.0 (identical)

ML Applications:

# 1. Neural network forward pass
def forward_pass(X, W, b):
    """
    X: input features
    W: weight matrix
    b: bias vector
    """
    return np.dot(X, W) + b

# 2. Attention mechanism (simplified)
def attention_score(query, key):
    """Calculate attention between query and key."""
    return np.dot(query, key) / np.sqrt(len(query))

# 3. Recommendation system
def predict_rating(user_vector, item_vector):
    """Predict user rating for item."""
    return np.dot(user_vector, item_vector)

user = np.array([0.8, 0.2, 0.9])    # User preferences
movie = np.array([0.9, 0.1, 0.8])   # Movie features
rating = predict_rating(user, movie)  # Predicted rating

1.4 Eigenvalues and Eigenvectors

Definition: For a square matrix A, if:

A × v = λ × v

Then v is an eigenvector and λ is the corresponding eigenvalue.

Intuition: Eigenvectors are special directions that only get scaled (not rotated) when transformed by the matrix.

import numpy as np

# Example matrix
A = np.array([
    [4, 2],
    [1, 3]
])

# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)

print("Eigenvalues:", eigenvalues)
# [5. 2.]

print("Eigenvectors:")
print(eigenvectors)
# [[ 0.89442719 -0.70710678]
#  [ 0.4472136   0.70710678]]

# Verification: A × v = λ × v
v1 = eigenvectors[:, 0]  # First eigenvector
lambda1 = eigenvalues[0]  # First eigenvalue

Av = np.dot(A, v1)
lambda_v = lambda1 * v1

print("A × v:", Av)
print("λ × v:", lambda_v)
# They should be equal (within floating point error)

Properties:

# 1. Trace = sum of eigenvalues
trace_A = np.trace(A)
sum_eigenvalues = np.sum(eigenvalues)
print(f"Trace: {trace_A}, Sum of eigenvalues: {sum_eigenvalues}")

# 2. Determinant = product of eigenvalues
det_A = np.linalg.det(A)
prod_eigenvalues = np.prod(eigenvalues)
print(f"Det: {det_A}, Product of eigenvalues: {prod_eigenvalues}")

# 3. For symmetric matrix, eigenvectors are orthogonal
S = np.array([[2, 1], [1, 2]])
evals, evecs = np.linalg.eig(S)
v1, v2 = evecs[:, 0], evecs[:, 1]
print(f"Dot product: {np.dot(v1, v2)}")  # ≈ 0 (orthogonal)

Principal Component Analysis (PCA):

def pca(X, n_components=2):
    """
    Principal Component Analysis using eigendecomposition.
    
    Args:
        X: Data matrix (n_samples × n_features)
        n_components: Number of principal components
    
    Returns:
        Transformed data
    """
    # Center the data
    X_centered = X - np.mean(X, axis=0)
    
    # Compute covariance matrix
    cov_matrix = np.cov(X_centered.T)
    
    # Compute eigenvalues and eigenvectors
    eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
    
    # Sort by eigenvalues (descending)
    idx = eigenvalues.argsort()[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Select top n_components
    top_eigenvectors = eigenvectors[:, :n_components]
    
    # Transform data
    X_transformed = np.dot(X_centered, top_eigenvectors)
    
    # Explained variance
    explained_var = eigenvalues[:n_components] / np.sum(eigenvalues)
    
    return X_transformed, explained_var

# Example usage
X = np.array([
    [2.5, 2.4],
    [0.5, 0.7],
    [2.2, 2.9],
    [1.9, 2.2],
    [3.1, 3.0]
])

X_pca, explained_var = pca(X, n_components=2)
print("Explained variance:", explained_var)
print("Transformed data:")
print(X_pca)

Graph Analysis:

def compute_pagerank(adjacency_matrix, damping=0.85, max_iter=100):
    """
    PageRank using power iteration (related to eigenvectors).
    
    The principal eigenvector of the transition matrix
    represents the steady-state probability distribution.
    """
    n = len(adjacency_matrix)
    
    # Create transition matrix
    out_degree = adjacency_matrix.sum(axis=1)
    out_degree[out_degree == 0] = 1  # Avoid division by zero
    
    transition = adjacency_matrix / out_degree[:, np.newaxis]
    
    # Add damping factor
    transition = damping * transition + (1 - damping) / n
    
    # Power iteration to find principal eigenvector
    rank = np.ones(n) / n
    
    for _ in range(max_iter):
        rank_new = transition.T @ rank
        if np.allclose(rank, rank_new):
            break
        rank = rank_new
    
    return rank

# Example: Simple web graph
# Page 0 → Page 1, Page 1 → Page 2, Page 2 → Page 0
adjacency = np.array([
    [0, 1, 0],
    [0, 0, 1],
    [1, 0, 0]
])

pagerank = compute_pagerank(adjacency)
print("PageRank scores:", pagerank)

Spectral Clustering:

def spectral_clustering(similarity_matrix, n_clusters=2):
    """
    Clustering using eigenvectors of graph Laplacian.
    """
    # Compute degree matrix
    D = np.diag(similarity_matrix.sum(axis=1))
    
    # Compute graph Laplacian
    L = D - similarity_matrix
    
    # Compute eigenvectors
    eigenvalues, eigenvectors = np.linalg.eig(L)
    
    # Sort by eigenvalues
    idx = eigenvalues.argsort()
    eigenvectors = eigenvectors[:, idx]
    
    # Use first n_clusters eigenvectors
    features = eigenvectors[:, :n_clusters]
    
    # Apply k-means on these features
    # (simplified, would use actual k-means)
    return features

1.5 Matrix Inversion

Definition: For matrix A, its inverse A⁻¹ satisfies:

A × A⁻¹ = A⁻¹ × A = I

Requirements:

Matrix must be square (n×n)
Matrix must be non-singular (determinant ≠ 0)

# Computing inverse
A = np.array([
    [4, 7],
    [2, 6]
])

A_inv = np.linalg.inv(A)

print("A:")
print(A)
print("\nA inverse:")
print(A_inv)

# Verification: A × A⁻¹ = I
identity = np.dot(A, A_inv)
print("\nA × A⁻¹:")
print(identity)
# [[1. 0.]
#  [0. 1.]]

2×2 Matrix Inverse (Manual):

def inverse_2x2(A):
    """
    For 2×2 matrix:
    A = [[a, b],
         [c, d]]
    
    A⁻¹ = (1/det) × [[ d, -b],
                     [-c,  a]]
    """
    a, b = A[0, 0], A[0, 1]
    c, d = A[1, 0], A[1, 1]
    
    det = a * d - b * c
    
    if det == 0:
        raise ValueError("Matrix is singular (not invertible)")
    
    return (1 / det) * np.array([
        [ d, -b],
        [-c,  a]
    ])

# Example
A = np.array([[4, 7], [2, 6]])
A_inv_manual = inverse_2x2(A)
print(np.allclose(A_inv_manual, A_inv))  # True

Solving Linear Systems:

# Solve A × x = b
A = np.array([
    [3, 1],
    [1, 2]
])
b = np.array([9, 8])

# Method 1: Using inverse (not recommended for large systems)
x = np.dot(np.linalg.inv(A), b)
print("Solution using inverse:", x)

# Method 2: Using solve (more efficient and stable)
x = np.linalg.solve(A, b)
print("Solution using solve:", x)

# Verification
print("Verification A × x:", np.dot(A, x))
print("Should equal b:", b)

ML Application – Linear Regression (Normal Equation):

def linear_regression_normal_equation(X, y):
    """
    Solve linear regression analytically:
    θ = (X^T × X)^(-1) × X^T × y
    
    Args:
        X: Feature matrix (m × n)
        y: Target vector (m,)
    
    Returns:
        θ: Optimal parameters (n,)
    """
    # Add bias term (column of 1s)
    X_bias = np.c_[np.ones(len(X)), X]
    
    # Compute (X^T × X)^(-1) × X^T × y
    XTX = np.dot(X_bias.T, X_bias)
    XTX_inv = np.linalg.inv(XTX)
    XTy = np.dot(X_bias.T, y)
    
    theta = np.dot(XTX_inv, XTy)
    
    return theta

# Example: House price prediction
X = np.array([
    [1000],  # sqft
    [1500],
    [2000],
    [2500]
])

y = np.array([200000, 250000, 300000, 350000])  # prices

theta = linear_regression_normal_equation(X, y)
print("Intercept:", theta[0])
print("Coefficient:", theta[1])

# Prediction
new_house = np.array([[1800]])
X_new = np.c_[np.ones(len(new_house)), new_house]
price_pred = np.dot(X_new, theta)
print(f"Predicted price for 1800 sqft: ${price_pred[0]:,.0f}")

Pseudoinverse (for non-square matrices):

# Moore-Penrose pseudoinverse
A = np.array([
    [1, 2],
    [3, 4],
    [5, 6]
])  # 3×2 (not square)

A_pinv = np.linalg.pinv(A)
print("Pseudoinverse shape:", A_pinv.shape)  # 2×3

# Properties
print("\nA × A⁺ × A ≈ A:")
print(np.allclose(A, A @ A_pinv @ A))  # True

print("\nA⁺ × A × A⁺ ≈ A⁺:")
print(np.allclose(A_pinv, A_pinv @ A @ A_pinv))  # True

When to Use What:

# For solving A × x = b:

# 1. Small system, A is square and invertible
# → Use np.linalg.solve() (not inv!)
x = np.linalg.solve(A, b)

# 2. Overdetermined system (more equations than unknowns)
# → Use least squares
x = np.linalg.lstsq(A, b, rcond=None)[0]

# 3. Underdetermined system (more unknowns than equations)
# → Use pseudoinverse for minimum norm solution
x = np.linalg.pinv(A) @ b

# 4. Very large system
# → Use iterative methods (gradient descent)

Part 2: Probability & Statistics

2.1 Measures of Central Tendency and Spread

Mean (Average)

Definition: Sum of all values divided by count.

Formula:

μ = (Σ xᵢ) / n

import numpy as np

# Sample data
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

# Mean
mean = np.mean(data)
print(f"Mean: {mean}")  # 5.0

# Manual calculation
manual_mean = sum(data) / len(data)

# Weighted mean
values = np.array([85, 90, 78])
weights = np.array([0.2, 0.3, 0.5])  # Exam weights
weighted_mean = np.average(values, weights=weights)
print(f"Weighted mean: {weighted_mean}")  # 82.9

ML Application:

# Feature scaling using mean
def standardize(X):
    """Z-score normalization."""
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    return (X - mean) / std

# Example
features = np.array([
    [100, 2.5],
    [150, 3.0],
    [200, 3.5]
])

scaled = standardize(features)
print("Standardized features:")
print(scaled)
print("Mean after scaling:", np.mean(scaled, axis=0))  # [0, 0]

Median

Definition: Middle value when data is sorted.

data = np.array([1, 3, 3, 6, 7, 8, 9])

median = np.median(data)
print(f"Median: {median}")  # 6.0

# For even number of elements
data_even = np.array([1, 2, 3, 4, 5, 6])
median_even = np.median(data_even)
print(f"Median (even): {median_even}")  # 3.5 (average of 3 and 4)

# Manual calculation
def calculate_median(data):
    sorted_data = np.sort(data)
    n = len(sorted_data)
    
    if n % 2 == 1:
        return sorted_data[n // 2]
    else:
        mid1 = sorted_data[n // 2 - 1]
        mid2 = sorted_data[n // 2]
        return (mid1 + mid2) / 2

# Median is robust to outliers
data_with_outlier = np.array([1, 2, 3, 4, 1000])
print(f"Mean with outlier: {np.mean(data_with_outlier)}")      # 202.0
print(f"Median with outlier: {np.median(data_with_outlier)}")  # 3.0

Variance and Standard Deviation

Variance: Average squared deviation from mean.

Formula:

σ² = Σ(xᵢ - μ)² / n        (population)
s² = Σ(xᵢ - x̄)² / (n-1)   (sample)

Standard Deviation: Square root of variance.

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

# Variance
variance_pop = np.var(data)          # Population variance (ddof=0)
variance_sample = np.var(data, ddof=1)  # Sample variance (ddof=1)

print(f"Population variance: {variance_pop}")    # 4.0
print(f"Sample variance: {variance_sample}")

Log In

Sign Up

Mathematics for AI & Machine Learning

Part 1: Linear Algebra

Scalars, Vectors, and Matrices

Matrix Operations

Dot Product

Eigenvalues & Eigenvectors

Matrix Inversion

Part 2: Probability & Statistics

Mean, Median, and Variance

Probability Rules

Random Variables

Probability Distributions

Bayes Theorem

Hypothesis Testing

Correlation & Covariance

Part 3: Calculus

Derivatives

Partial Derivatives

Gradients

Chain Rule

Optimization Basics

Loss Functions

Why Mathematics is Non-Negotiable

1.1 Scalars, Vectors, and Matrices

Scalars

Vectors

Matrices

1.2 Matrix Operations

Matrix Addition and Subtraction

Matrix Multiplication

Matrix-Vector Multiplication

1.3 Dot Product

1.4 Eigenvalues and Eigenvectors

1.5 Matrix Inversion

Part 2: Probability & Statistics

2.1 Measures of Central Tendency and Spread

Mean (Average)

Median

Variance and Standard Deviation

Leave a Comment