Feature Engineering (Very Important)

1. Feature Extraction

Feature extraction involves creating new features from raw data that might not be immediately usable by an algorithm.

Text Data: Converting raw text into numbers using TF-IDF (Term Frequency-Inverse Document Frequency) or Word Embeddings.
Image Data: Extracting edges, shapes, or textures using Convolutional Neural Networks (CNNs).
Datetime Data: A raw timestamp (e.g., 2023-12-25 08:30:00) is hard for a model to process. You extract:
- Hour of day (to see if it’s morning/night).
- Day of week (to see if it’s a weekend).
- Is it a holiday? (1 or 0).
Example: In a GPS dataset, extracting “Distance from City Center” from raw Latitude and Longitude coordinates.

This involves mathematically changing the data to meet the assumptions of the model (e.g., making the distribution more “Normal” or scaling numbers).

Scaling & Normalization:
- Standardization (Z-score): Centers data at mean 0 with a standard deviation of 1.
- Min-Max Scaling: Squishes all values between 0 and 1. (Essential for KNN and SVM).
Log Transformation: Used on skewed data (like income or house prices) to reduce the impact of extreme outliers.
Encoding Categorical Data:
- One-Hot Encoding: Creates binary columns for categories (e.g., Color: Red $\rightarrow$ [1, 0, 0]).
- Label Encoding: Assigns a number to each category (e.g., Low: 1, Med: 2, High: 3).
Example: If you have “Income” ranging from $\$20,000$ to $\$2,000,000$, a Log Transformation helps the model not be overwhelmed by the multi-millionaires.

Not all features are helpful. Some are redundant or just “noise.” Feature selection keeps only the most relevant variables.

Method Type	Description	Examples
Filter Methods	Statistical tests used before training.	Correlation Heatmaps, Chi-Square test.
Wrapper Methods	Trains the model on different subsets of features.	Forward Selection, Backward Elimination.
Embedded Methods	Feature selection happens during training.	Lasso (L1) Regularization, Random Forest Importance.

In many real-world problems (Fraud Detection, Rare Disease diagnosis), one class has 99% of the data and the other has 1%.

Undersampling: Deleting records from the majority class (Risky, as you lose info).
Oversampling: Duplicating records from the minority class.
SMOTE (Synthetic Minority Over-sampling Technique): Instead of just duplicating data, it creates synthetic (fake but realistic) data points by looking at the distance between existing minority points.
Example: In credit card fraud, if you only have 10 fraud cases, SMOTE creates 100 “simulated” fraud cases that look mathematically similar to the originals to help the model learn the pattern.

These are features created based on specific industry knowledge rather than just math. They often provide the “breakthrough” in model accuracy.

E-commerce: Creating a “Return Rate” feature (Total Returns / Total Orders) instead of just looking at raw order counts.
Finance: The “Debt-to-Income Ratio.” A bank doesn’t just care about how much you owe; they care about how much you owe relative to what you earn.
Health: “BMI” (Body Mass Index). It’s a derived feature from Height and Weight that provides more medical context than either value alone.
Example: In a taxi-hailing app, creating a “Rainy Hour” feature because domain knowledge says people book more rides when it’s raining.