In AI and Machine Learning, data is more important than algorithms.
A simple model trained on clean, meaningful data often outperforms a complex model trained on poor data.
This module focuses on understanding, preparing, and analyzing data so that ML models can learn correctly and efficiently.
A dataset is a collection of data points used to train, validate, and test machine learning models.
Example:
Numerical Variables
Categorical Variables
Understanding variable types is crucial for choosing correct preprocessing techniques.
Structured data is organized in a fixed format, usually in tables.
Examples:
Characteristics:
Example:
Customer data with columns like age, gender, income
Unstructured data has no predefined format.
Examples:
Characteristics:
Example:
Social media posts or medical images
Good models start with good data sources.
Databases
APIs
Web Scraping
Sensors & IoT
User-Generated Data
Poor data collection leads to biased or inaccurate models.
Data cleaning is the process of fixing or removing incorrect, incomplete, or irrelevant data.
Clean data leads to stable and reliable ML models.
Missing data is unavoidable in real-world datasets.
1. Remove Rows or Columns
2. Mean / Median Imputation
3. Mode Imputation
4. Forward / Backward Fill
5. Model-Based Imputation
Choice of method impacts model bias and accuracy.
Outliers are extreme values that differ significantly from other observations.
1. Z-Score Method
2. IQR (Interquartile Range)
3. Visualization
Outliers should be analyzed, not blindly removed.
ML models perform better when features are on similar scales.
1. Normalization (Min-Max Scaling)
Used when:
2. Standardization (Z-score Scaling)
Used when:
ML models require numerical input, but real data often contains categories.
Example:
Problem:
Used when:
Example:
City → Delhi, Mumbai, Chennai
Used when:
Wrong encoding can mislead models and reduce accuracy.
EDA is the process of understanding data before modeling.
Summary Statistics
Visualization
Feature Relationships
EDA helps answer: