Data Collection & Data Cleaning

This module is where the actual work of a Data Scientist happens. There is a common saying in the industry: “80% of data science is cleaning data, and the 20% is complaining about cleaning data.” Quality insights are impossible without quality data (Garbage In, Garbage Out).

1. Data Sources

Data rarely arrives perfectly formatted. You must know how to pull it from various “containers”:

CSV & Excel: The most common flat-file formats. Pandas makes these easy to load with read_csv() and read_excel().
APIs (JSON): Many modern services (like Weather or Stock data) provide data via REST APIs. You use the requests library to fetch JSON and convert it into a DataFrame.
Databases (SQL): In a corporate setting, data lives in SQL Server, PostgreSQL, or MySQL. You use libraries like SQLAlchemy to query these databases directly into Python.

2. Web Scraping Basics

When data isn’t available via a file or an API, you “scrape” it from websites.

Tools: BeautifulSoup for parsing HTML and Selenium for scraping websites that require interaction (like clicking buttons).
Process: You download the HTML source code, find the specific tags (like <table> or <div>), and extract the text.

3. Handling Missing Values

Real-world data is full of holes (NaNs). You have three main strategies:

Deletion: Remove the row or column. (Only do this if you have a massive dataset and the missing data is minimal).
Imputation (Mean/Median): Fill the holes with the average. This is common for numerical data.
Constant/Mode: Fill with a specific value like “Unknown” or the most frequent item.

Python

# Simple Imputation with Pandas
df['Age'] = df['Age'].fillna(df['Age'].mean())

4. Handling Duplicates

Duplicate records can skew your analysis, making certain patterns look more frequent than they actually are.

Identification: df.duplicated() identifies rows where every column matches.
Removal: df.drop_duplicates() removes them. You must decide whether to keep the “first” or “last” occurrence.

5. Outlier Detection

Outliers are data points that are significantly different from the rest of the observations (e.g., a person’s age listed as 200).

Z-Score: Measuring how many standard deviations a point is from the mean.
IQR (Interquartile Range): The most common method. Any point below $Q1 – 1.5 \times IQR$ or above $Q3 + 1.5 \times IQR$ is considered an outlier.

6. Data Normalization & Scaling

Machine Learning models struggle when one feature has a range of 0–1 (like probability) and another has 0–1,000,000 (like house price). Scaling brings them to the same “level.”

Min-Max Scaling (Normalization): Rescales data to a fixed range, usually 0 to 1.$$x_{new} = \frac{x – x_{min}}{x_{max} – x_{min}}$$
Standardization (Z-score Scaling): Rescales data so it has a mean of 0 and a standard deviation of 1.$$z = \frac{x – \mu}{\sigma}$$

7. Encoding Categorical Variables

Machine Learning models are mathematical; they cannot understand “Red,” “Green,” or “Blue.” We must convert text to numbers.

Label Encoding: Assigns a unique number to each category (e.g., Red=0, Green=1). Use this for Ordinal data (where order matters, like Small, Medium, Large).
One-Hot Encoding: Creates new columns for each category with 0s and 1s. Use this for Nominal data (where order doesn’t matter).

8. Feature Transformation

Sometimes data needs to be mathematically transformed to reveal its true pattern or to fit the assumptions of a model (like a Normal Distribution).

Log Transformation: Useful for skewed data (like income) to compress high values and spread out low values.
Power Transformation: Squaring or taking the square root of a feature to handle non-linear relationships.

Log In

Sign Up