Case Study 2 · IEEE-CIS Fraud Detection
Our first case study showed Sentinel catching a real recall collapse on the ULB Credit Card Fraud dataset. A monitoring system is only useful if its alerts mean something — which means it also needs to correctly say nothing is wrong when nothing is wrong.
IEEE-CIS Fraud Detection is a good test of exactly that: larger (590,540 transactions), messier (434 raw columns including transaction metadata and device/identity data), and with a fraud rate 20× higher than the Credit Card dataset (3.499% vs 0.173%).
Case Study 1 proved Sentinel catches a real failure. Case Study 2 proves Sentinel doesn't manufacture false alarms. You need both to trust a monitoring system. A system that fires on everything is as useless as one that fires on nothing.
After dropping columns with more than 90% missing values, encoding 30 categorical fields, and engineering log-transformed amount and zero-amount indicator features (425 columns total), we ran an 18-trial Optuna search optimizing PR-AUC.
The Credit Card model scored 0.977 ROC-AUC. This model scores 0.888. That's not a step backward — IEEE-CIS has a 3.5% fraud rate (20× higher than Credit Card's 0.17%), plus far noisier anonymized features (V258, V218, etc. dominate importance, with no interpretable meaning). A model catching ~half of fraud at default threshold 0.50 is a realistic starting point — exactly where threshold optimization makes its biggest difference.
As with the Credit Card model, the default 0.50 threshold was the wrong threshold. Sentinel's cost-aware threshold analysis found the optimal decision boundary at 0.15 — a larger swing than Case Study 1's correction, because higher fraud rates change the optimal cost structure significantly.
| Threshold | Recall | Precision | Projected cost |
|---|---|---|---|
| 0.50 (default) | 49.3% | 47.7% | $164,645 |
| 0.15 (optimal) | 70.2% | 21.1% | $132,485 |
Moving to the cost-optimal threshold recovered 21 points of recall and reduced projected cost by $32,160 on the test set. The default decision boundary almost never reflects the actual fraud-rate and cost structure of a production system.
Using the 0.15 threshold, we ran the same five-period drift simulation used in Case Study 1. Period-to-period fraud rate ranged from 2.99% to 4.22% — genuine, non-trivial variation. The question: would Sentinel fire a false alarm?
| Period | Fraud rate | Recall | Precision | Mean PSI | Status |
|---|---|---|---|---|---|
| P1 | 2.99% | 71.8% | 20.2% | 0.0127 | ✅ Healthy |
| P2 | 3.23% | 70.3% | 19.2% | 0.0028 | ✅ Healthy |
| P3 | 3.36% | 65.3% | 18.7% | 0.0045 | ✅ Healthy |
| P4 | 3.61% | 72.5% | 22.9% | 0.0103 | ✅ Healthy |
| P5 | 4.22% | 70.8% | 24.2% | 0.0075 | ✅ Healthy |
Recall held in a tight 65.3–72.5% band across all five periods. No period dropped below the 60% alert threshold. PSI stayed well under the levels that preceded the Credit Card model's Period 5 breakdown (0.0104 there vs. a peak of 0.0127 here, without the corresponding recall collapse). Zero false alarms fired.
Together, these two studies apply the same monitoring methodology to two different outcomes: one model degrading, correctly flagged; one model holding steady, correctly left alone. That's the minimum viable test of whether a monitoring system is actually measuring something real rather than generating noise.
This evaluation uses the public IEEE-CIS Fraud Detection dataset and simulated production-drift periods constructed from held-out test data, not live customer transactions. Alert thresholds (60% recall floor) are set per-model based on validated baseline performance, not a fixed universal number.
A Sentinel audit answers that in 5–7 days — without system access, using your existing prediction logs.