In Unsupervised Learning, the model works with unlabeled data. There is no “target” variable to predict; instead, the goal is to discover hidden structures, groupings, or patterns within the data. This is divided into two main tasks: Clustering (grouping similar items) and Dimensionality Reduction (simplifying data).
Clustering is the process of partitioning a dataset into groups (clusters) so that items in the same group are more similar to each other than to those in other groups. It is used for customer segmentation, image compression, and anomaly detection.
K-Means is the most popular clustering algorithm. It groups data by minimizing the distance between data points and a central point (centroid).
Instead of flat clusters, this creates a tree-like structure of groupings.
Unlike K-Means, DBSCAN groups points based on how “dense” an area is.
[Image comparison of K-Means vs DBSCAN on non-spherical data shapes]
Modern datasets often have hundreds of features (dimensions). High dimensionality can lead to the “Curvature of Dimensionality,” where models become slow and data becomes sparse. Dimensionality reduction compresses this information while keeping the important parts.
PCA is a linear technique that transforms a large set of variables into a smaller one that still contains most of the information.
t-SNE is a non-linear technique specifically designed for visualization.
| Feature | K-Means | DBSCAN | PCA |
| Type | Clustering | Clustering | Dim. Reduction |
| Shape | Spherical/Circular | Any shape | N/A |
| Outliers | Forces them into a cluster | Labels them as “Noise” | N/A |
| Parameters | Must choose $K$ | Must choose Radius ($\epsilon$) | Must choose # of Components |
Python
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Assuming 'X' is your clean dataset
model = KMeans(n_clusters=3, random_state=42)
clusters = model.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.title("K-Means Grouping")
plt.show()