E-Commerce Customer Churn Prediction
Built a machine learning-based customer churn prediction system for an e-commerce platform using a historical dataset of 3,270 customers. Linear Discriminant Analysis (LDA) was selected as the best model after benchmarking 23 algorithms, handling class imbalance with SMOTE, and hyperparameter tuning.
Detailed Insights
Business Problem & Data Understanding
E-commerce is a highly competitive industry where customer churn causes significant revenue loss. The primary goal is to identify potential churn early. The dataset consists of 3,270 records after cleaning, with a significant class imbalance (churn ratio of 1:5).
Data Preprocessing & Pipeline
Missing values were handled with median imputation. Categorical features were encoded using OneHotEncoder and BinaryEncoder. RobustScaler handled skewed numerical distributions with outliers, while SelectKBest was used to select the most informative features.
Model Benchmarking & Handling Imbalance
23 classification algorithms were evaluated across 92 combinations using GridSearchCV. To address the class imbalance, SMOTE, NCR, and Penalized Models were tested. Linear Discriminant Analysis (LDA) with SMOTE resampling achieved the highest FBeta Score of 0.8025.
Evaluation & Cost-Benefit Analysis
Evaluated on a 654 test set, achieving a Recall of 79.8% and FBeta of 0.77. Gap PR-AUC train vs test was minimal (0.0441), indicating no overfitting. In financial terms, the model retains Rp 118.5M in Customer Lifetime Value, yielding a net benefit of Rp 69.3M and saving Rp 217.8M compared to the 'no model' baseline.
Tech Stack
Key Results
- Linear Discriminant Analysis as Best Model
- Recall: 79.8% on unseen data
- FBeta Score (β=5): 0.77
- Estimated Net Benefit: Rp 69.300.000