This module is where the actual work of a Data Scientist happens. There is a common saying in the industry: “80% of data science is cleaning data, and the 20% is complaining about cleaning data.” Quality insights are impossible without quality data (Garbage In, Garbage Out).
Data rarely arrives perfectly formatted. You must know how to pull it from various “containers”:
read_csv() and read_excel().requests library to fetch JSON and convert it into a DataFrame.SQLAlchemy to query these databases directly into Python.When data isn’t available via a file or an API, you “scrape” it from websites.
BeautifulSoup for parsing HTML and Selenium for scraping websites that require interaction (like clicking buttons).<table> or <div>), and extract the text.Real-world data is full of holes (NaNs). You have three main strategies:
Python
# Simple Imputation with Pandas
df['Age'] = df['Age'].fillna(df['Age'].mean())
Duplicate records can skew your analysis, making certain patterns look more frequent than they actually are.
df.duplicated() identifies rows where every column matches.df.drop_duplicates() removes them. You must decide whether to keep the “first” or “last” occurrence.Outliers are data points that are significantly different from the rest of the observations (e.g., a person’s age listed as 200).
Machine Learning models struggle when one feature has a range of 0–1 (like probability) and another has 0–1,000,000 (like house price). Scaling brings them to the same “level.”
Machine Learning models are mathematical; they cannot understand “Red,” “Green,” or “Blue.” We must convert text to numbers.
Sometimes data needs to be mathematically transformed to reveal its true pattern or to fit the assumptions of a model (like a Normal Distribution).