In this final module of Machine Learning, we move into the “Heavy Hitters.” Ensemble learning techniques are the go-to choice for winning Kaggle competitions and are widely used in the industry for credit scoring, churn prediction, and fraud detection. They work on the principle that “many weak voices are smarter than one loud one.”
Ensemble learning is a technique that combines multiple individual models (often called “base learners” or “weak learners”) to create one superior “Strong Learner.”1
These are the two main “strategies” for building an ensemble.
While standard Gradient Boosting is powerful, it is slow. The industry has moved toward three highly optimized libraries that handle massive datasets with incredible speed.
XGBoost was the library that revolutionized competitive data science.
Developed by Microsoft, LightGBM is designed for speed and low memory usage.12
Developed by Yandex, CatBoost is the newest of the three and solves a very specific pain point: Categorical Data.13
| Feature | XGBoost | LightGBM | CatBoost |
| Developer | Community (DMLC) | Microsoft | Yandex |
| Key Strength | Accuracy & Versatility | Incredible Speed | Handling Categorical Data |
| Tree Growth | Level-wise | Leaf-wise | Symmetric trees |
| Memory Usage | Medium | Low | Medium |
| Handling Categories | Manual Encoding needed | Some auto-handling | Fully Automatic |
Python
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Assuming X and y are prepared
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create the model
model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=5)
# Train
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
print(f"XGBoost Accuracy: {accuracy_score(y_test, predictions)}")
As an advanced practitioner, you now have the tools to handle any tabular data challenge. You know that: