Unsupervised Learning Algorithms

In Unsupervised Learning, the model works with unlabeled data. There is no “target” variable to predict; instead, the goal is to discover hidden structures, groupings, or patterns within the data. This is divided into two main tasks: Clustering (grouping similar items) and Dimensionality Reduction (simplifying data).

1. Clustering Basics

Clustering is the process of partitioning a dataset into groups (clusters) so that items in the same group are more similar to each other than to those in other groups. It is used for customer segmentation, image compression, and anomaly detection.

K-Means Clustering

K-Means is the most popular clustering algorithm. It groups data by minimizing the distance between data points and a central point (centroid).

How it works:
1. Pick $K$ (number of clusters) and randomly place $K$ centroids.
2. Assign each data point to the nearest centroid.
3. Move the centroid to the center of all points assigned to it.
4. Repeat until the centroids stop moving.
Weakness: You must decide the value of $K$ in advance (often using the “Elbow Method”). It also struggles with non-spherical shapes.

Hierarchical Clustering

Instead of flat clusters, this creates a tree-like structure of groupings.

Agglomerative (Bottom-Up): Every point starts as its own cluster. The algorithm repeatedly merges the two closest clusters until only one big cluster remains.
Dendrogram: A visualization used to see these relationships. You can “cut” the tree at different heights to get different numbers of clusters.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Unlike K-Means, DBSCAN groups points based on how “dense” an area is.

Core Concepts: It identifies “Core points” (points with many neighbors) and “Noise” (isolated points).
Pros: It can find clusters of any shape (like circles or crescents) and automatically identifies outliers (noise) that don’t belong to any group.

[Image comparison of K-Means vs DBSCAN on non-spherical data shapes]

2. Dimensionality Reduction

Modern datasets often have hundreds of features (dimensions). High dimensionality can lead to the “Curvature of Dimensionality,” where models become slow and data becomes sparse. Dimensionality reduction compresses this information while keeping the important parts.

PCA (Principal Component Analysis)

PCA is a linear technique that transforms a large set of variables into a smaller one that still contains most of the information.

How it works: It finds the “Principal Components”—new axes that capture the maximum “variance” (spread) of the data.
Use Case: Reducing a 100-feature dataset down to 2 or 3 features so it can be plotted on a graph, or speeding up a Machine Learning model by removing redundant data.

t-SNE (t-distributed Stochastic Neighbor Embedding)

t-SNE is a non-linear technique specifically designed for visualization.

How it works: It tries to keep “similar” points close together in 2D or 3D space, even if the relationship in the original high-dimensional space was complex and curvy.
Difference from PCA: PCA is better for preserving the global structure of data, while t-SNE is much better at showing local clusters and sub-groups in highly complex data (like image or text data).

3. Real-World Comparison

Feature	K-Means	DBSCAN	PCA
Type	Clustering	Clustering	Dim. Reduction
Shape	Spherical/Circular	Any shape	N/A
Outliers	Forces them into a cluster	Labels them as “Noise”	N/A
Parameters	Must choose $K$	Must choose Radius ($\epsilon$)	Must choose # of Components

Python Example (K-Means)

Python

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Assuming 'X' is your clean dataset
model = KMeans(n_clusters=3, random_state=42)
clusters = model.fit_predict(X)

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.title("K-Means Grouping")
plt.show()

Log In

Sign Up