Classified: At-Risk — AI for Employee Turnover

Employee turnover is costly, disruptive, and often preventable. This project builds a recall-focused machine learning model to help HR teams identify at-risk employees early, before burnout or disengagement leads to resignation.

Project Links

Executive Summary & Non-Technical Guide (Web)

Python Exploratory Data Analysis

GitHub Repository

Think of this project as a user’s guide to the cranky crystal ball of machine learning. I used it to put my understanding of applied ML to the test: tuning models to catch as many at-risk employees as possible, weighing false alarms against missed warning signs, and turning uncertain predictions into clear recommendations HR could use. The report is mostly tech-heavy, but includes a guided, layperson-friendly section right after the Executive Summary—part intro to machine learning, part personal field journal—for those curious how the magic (i.e., math) happens.

Project Overview

The goal of this project was not just prediction accuracy, but providing early, trustworthy signals HR teams could use to intervene before attrition occurs. Through Python-based exploratory data analysis (EDA) and predictive modeling, I developed a data pipeline to highlight at-risk employees and enable proactive retention strategies.

The project began with a deep dive into the data using Pandas, Matplotlib, and Seaborn to explore employee satisfaction, tenure, workload, and performance. Key EDA findings, such as the U-shaped relationship between tenure and churn, shaped our feature engineering and modeling priorities.

Using a mix of interpretable models (Logistic Regression, Decision Trees) and high-performing black-box models (Random Forest, XGBoost), the final pipeline emphasized recall, ensuring HR teams identify as many potentially departing employees as possible. XGBoost emerged as the best-performing model under recall constraints, particularly in predicting the “gray zone” employees, those with middling satisfaction and tenure.

To demystify the model’s logic, I incorporated SHAP values for model explainability, offering HR a clear view into which features most influenced predictions. Burnout signals like high hours, high evaluation scores, and heavy workloads surfaced as key churn predictors.

This project demonstrates my ability to combine domain knowledge, EDA, feature engineering, machine learning, and interpretability tools to generate business-ready insights. The outcome is a powerful model backed by actionable analysis, designed to help Salifort keep their people and stay ahead of turnover.

Gallery

SHAP summary dot plot showing how each feature impacts XGBoost churn predictions, with satisfaction and time spent emerging as the strongest drivers.

SHAP summary plot: Highlighting which features most strongly increase predicted churn risk. Used to surface actionable burnout and disengagement signals.

Scatterplot comparing employee satisfaction and average monthly hours, forming two main clusters: overworked dissatisfied leavers, and low-work disengaged employees.

Scatterplot comparing employee satisfaction and average monthly hours, forming two main clusters: overworked dissatisfied leavers, and low-work disengaged employees.

Satisfaction vs. Average Monthly Hours: Two main clusters emerge—overworked, dissatisfied leavers and underutilized, disengaged employees.

Decision tree diagram illustrating churn predictions based on satisfaction, tenure, and workload thresholds.

Decision tree diagram illustrating churn predictions based on satisfaction, tenure, and workload thresholds.

Decision Tree: A step-by-step breakdown of how key features like satisfaction, tenure, and workload drive attrition predictions.

Side-by-side confusion matrices displaying how multiple models classified employees who stayed versus those who left.

Side-by-side confusion matrices displaying how multiple models classified employees who stayed versus those who left.

Confusion Matrices: Comparing model performance in correctly identifying employees who stayed or left.

Histogram showing the predicted churn probabilities for misclassified employees, highlighting low and uncertain confidence ranges.

Histogram showing the predicted churn probabilities for misclassified employees, highlighting low and uncertain confidence ranges.

Predicted Probabilities: Distribution of model confidence for cases it misclassified, highlighting areas of uncertainty.

Bar chart ranking features by model importance, with satisfaction, tenure, and workload as top predictors of employee attrition.

Bar chart ranking features by model importance, with satisfaction, tenure, and workload as top predictors of employee attrition.

Feature Importances: Ranking the most influential factors driving employee attrition in the final model.

References

Original dataset available on Kaggle. The data has been repurposed and adapted for this project as part of the Google Advanced Data Analytics Capstone.