In Module 15, we enter the world of Data Science. Python is the industry leader here because it allows you to take massive amounts of “raw” data and turn them into actionable insights and beautiful visualizations.
NumPy is the foundational package for scientific computing. It introduces Arrays, which are much faster and more memory-efficient than standard Python lists for mathematical operations.
Example:
Python
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr * 2) # Multiplies every element: [2, 4, 6, 8, 10]
Pandas is built on top of NumPy and provides the DataFrame—a 2D table structure that looks like an Excel spreadsheet. It is the go-to tool for data manipulation.
Example:
Python
import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Score': [85, 92]}
df = pd.DataFrame(data)
print(df.head())
Real-world data is “messy.” It has missing values, duplicates, and errors. Data cleaning involves:
df.dropna() (remove) or df.fillna() (replace).df.drop_duplicates().This involves reshaping your data to make it useful.
groupby().Example:
Python
# Grouping data by 'Category' and finding the average 'Price'
avg_prices = df.groupby('Category')['Price'].mean()
Pandas makes it incredibly easy to move data between Python and your local files.
df = pd.read_csv('file.csv') or pd.read_excel('file.xlsx').df.to_csv('new_file.csv', index=False).EDA is the process of “getting to know” your data before applying Machine Learning. You look for patterns, outliers, and correlations using methods like df.describe() (summary stats) and df.info().
These are the primary libraries for Data Visualization.
Example:
Python
import matplotlib.pyplot as plt
import seaborn as sns
sns.lineplot(data=df, x='Date', y='Sales')
plt.title("Sales Trends Over Time")
plt.show()