Case Study · Credit Card Fraud Detection
We trained a fraud classifier on the ULB Credit Card Fraud dataset — 284,807 transactions, only 492 of them fraud. This is a standard, widely-used benchmark, which matters here: it means every number below can be independently checked by anyone who downloads the same public dataset.
After feature engineering — 9 engineered features added to the original 28 PCA components (rolling stats, interaction terms, log-transformed amount) for 38 features total — and a 40-trial Optuna search optimizing for PR-AUC, the final model scored:
The optimal threshold of 0.73 — selected via cost-sensitivity analysis rather than the default 0.50 — alone reduced projected cost from $1,990 to $1,850 on the test set. The default threshold is almost never the right threshold in fraud detection.
A model's validation score is a snapshot. Production is not a snapshot — transaction patterns drift, fraud tactics evolve, and a model that scored 0.977 ROC-AUC on Tuesday can quietly degrade by Friday with no error, no crash, and no alert, unless something is specifically watching for it.
To test this, we ran the deployed model against five sequential slices of held-out data (~11,393 transactions each), simulating the kind of gradual population drift a live system would encounter over time, and tracked three things Sentinel monitors continuously: recall, feature-level PSI, and cost.
| Period | Recall | Precision | Mean PSI | Top drifting feature | Status |
|---|---|---|---|---|---|
| P1 | 88.9% | 84.2% | 0.0022 | V10 (PSI 0.010) | ✅ Healthy |
| P2 | 78.3% | 85.7% | 0.0047 | Amount_log (PSI 0.029) | ✅ Healthy |
| P3 | 75.0% | 100.0% | 0.0070 | Amount_log (PSI 0.049) | ✅ Healthy |
| P4 | 75.0% | 54.5% | 0.0088 | Amount_log (PSI 0.064) | ⚠️ Watch |
| P5 | 60.0% | 60.0% | 0.0104 | Amount_log (PSI 0.087) | 🔴 RECALL_ALERT |
By Period 5, recall had fallen from 88.9% to 60.0% — a 29-point drop — and PSI on the transaction-amount distribution had climbed nearly 5× from where it started. Sentinel flagged this automatically: RECALL_ALERT: 60.0% < 70% threshold
This is the failure mode that matters most in fraud detection: the model doesn't break, it just quietly starts missing more fraud. Without a system watching recall and feature drift period-over-period, this decay is invisible until someone notices fraud losses climbing weeks later.
A model can score well in the notebook that trained it and then behave differently once serialized, stored, and reloaded from a model registry — the exact path a model takes into production. This is a less-discussed failure mode, but a real one.
We pushed the trained model and scaler to a private Hugging Face repository, then pulled them back down cold — no shared state, no cached objects — loaded them, and re-scored a held-out validation set.
Performance is within noise of training-time scores — confirming the model behaves consistently after the exact serialize → store → reload cycle a production deployment actually uses, not just inside the training notebook. This is what "production-validated" actually means.
This model's ROC-AUC never moved dramatically. The failure was entirely in recall and feature drift — metrics most dashboards don't surface by default. If you're only watching AUC, you're watching the wrong thing.
PSI on Amount_log was already elevated two periods before recall crossed the alert threshold. Sentinel's early signal gives you time to act — adjust threshold, investigate, or schedule retraining — before losses compound.
The optimal threshold at training time (0.73) was based on the training-period cost structure. As drift accumulates, that optimal shifts. Sentinel's cost-sensitivity analysis is designed to be re-run continuously, not set once and forgotten.
This evaluation uses the public ULB Credit Card Fraud dataset and simulated production-drift periods constructed from held-out data, not live customer transactions. It demonstrates Sentinel's monitoring methodology on a fully reproducible, independently verifiable benchmark. A second case study using the larger IEEE-CIS Fraud Detection dataset (590,540 transactions) is in progress.
Most teams can't answer that question without running an audit. The Sentinel audit tells you — in 5–7 days, without system access, using your existing prediction logs.