Case Study · Credit Card Fraud Detection

Detecting Silent Model Failure in a Production Fraud Detector

Dataset: ULB Credit Card Fraud (Kaggle) 284,807 transactions XGBoost · Optuna-tuned (40 trials) Monitoring: Strataforge Sentinel
✓ Fully reproducible · Public dataset · Every number below is independently verifiable
0.977
Training ROC-AUC
60.0%
Recall at Period 5 (was 88.9%)
−29 pts
Recall drop across 5 periods
0.73
Optimal threshold (not 0.50 default)
The setup

A strong model. Trained, validated, deployed. Then what?

We trained a fraud classifier on the ULB Credit Card Fraud dataset — 284,807 transactions, only 492 of them fraud. This is a standard, widely-used benchmark, which matters here: it means every number below can be independently checked by anyone who downloads the same public dataset.

After feature engineering — 9 engineered features added to the original 28 PCA components (rolling stats, interaction terms, log-transformed amount) for 38 features total — and a 40-trial Optuna search optimizing for PR-AUC, the final model scored:

Training performance
ROC-AUC0.977
PR-AUC0.796
Decision threshold0.73
Tuning40-trial Optuna · PR-AUC objective
Feature engineering
Original features28 PCA components + Amount + Time
Engineered features+9 (rolling stats, interaction terms, Amount_log)
Total features38
Threshold basisCost-sensitivity analysis, not default 0.50
◈ Note on threshold

The optimal threshold of 0.73 — selected via cost-sensitivity analysis rather than the default 0.50 — alone reduced projected cost from $1,990 to $1,850 on the test set. The default threshold is almost never the right threshold in fraud detection.


The real question

Does it stay good?

A model's validation score is a snapshot. Production is not a snapshot — transaction patterns drift, fraud tactics evolve, and a model that scored 0.977 ROC-AUC on Tuesday can quietly degrade by Friday with no error, no crash, and no alert, unless something is specifically watching for it.

To test this, we ran the deployed model against five sequential slices of held-out data (~11,393 transactions each), simulating the kind of gradual population drift a live system would encounter over time, and tracked three things Sentinel monitors continuously: recall, feature-level PSI, and cost.

Period Recall Precision Mean PSI Top drifting feature Status
P1 88.9% 84.2% 0.0022 V10 (PSI 0.010) ✅ Healthy
P2 78.3% 85.7% 0.0047 Amount_log (PSI 0.029) ✅ Healthy
P3 75.0% 100.0% 0.0070 Amount_log (PSI 0.049) ✅ Healthy
P4 75.0% 54.5% 0.0088 Amount_log (PSI 0.064) ⚠️ Watch
P5 60.0% 60.0% 0.0104 Amount_log (PSI 0.087) 🔴 RECALL_ALERT
Recall Decay — 5 Production Periods
P1 baseline 88.9% → P5 final 60.0% · Alert threshold: 70% · PSI 5× climb on Amount_log
Amount_log PSI — Feature Drift Signal (5× climb)
PSI signal elevated at P3 → P4 (Watch), firing RECALL_ALERT at P5 · Watch threshold: 0.10 · Alert: 0.25
🔴 Critical finding

By Period 5, recall had fallen from 88.9% to 60.0% — a 29-point drop — and PSI on the transaction-amount distribution had climbed nearly 5× from where it started. Sentinel flagged this automatically: RECALL_ALERT: 60.0% < 70% threshold

◈ Why this matters

This is the failure mode that matters most in fraud detection: the model doesn't break, it just quietly starts missing more fraud. Without a system watching recall and feature drift period-over-period, this decay is invisible until someone notices fraud losses climbing weeks later.


Proving production-readiness

Does the model survive the path into production?

A model can score well in the notebook that trained it and then behave differently once serialized, stored, and reloaded from a model registry — the exact path a model takes into production. This is a less-discussed failure mode, but a real one.

We pushed the trained model and scaler to a private Hugging Face repository, then pulled them back down cold — no shared state, no cached objects — loaded them, and re-scored a held-out validation set.

After Hugging Face reload
Rows scored56,961
ROC-AUC0.968
PR-AUC0.788
Recall (fraud class)77.2%
Precision (fraud class)62.9%
vs. Training-time scores
ROC-AUC delta−0.009 (within noise)
PR-AUC delta−0.008 (within noise)
Reload pathserialize → Hugging Face → cold pull → load → score
ResultConsistent ✓
✓ Result

Performance is within noise of training-time scores — confirming the model behaves consistently after the exact serialize → store → reload cycle a production deployment actually uses, not just inside the training notebook. This is what "production-validated" actually means.


What this demonstrates

Three things Sentinel proved here

01

A strong validation score is not the same as a reliable production system

This model's ROC-AUC never moved dramatically. The failure was entirely in recall and feature drift — metrics most dashboards don't surface by default. If you're only watching AUC, you're watching the wrong thing.

02

Silent degradation is detectable before it becomes a loss event

PSI on Amount_log was already elevated two periods before recall crossed the alert threshold. Sentinel's early signal gives you time to act — adjust threshold, investigate, or schedule retraining — before losses compound.

03

Threshold and cost tuning is not a one-time step

The optimal threshold at training time (0.73) was based on the training-period cost structure. As drift accumulates, that optimal shifts. Sentinel's cost-sensitivity analysis is designed to be re-run continuously, not set once and forgotten.

Methodology note

This evaluation uses the public ULB Credit Card Fraud dataset and simulated production-drift periods constructed from held-out data, not live customer transactions. It demonstrates Sentinel's monitoring methodology on a fully reproducible, independently verifiable benchmark. A second case study using the larger IEEE-CIS Fraud Detection dataset (590,540 transactions) is in progress.

Does your production model have a recall you can verify?

Most teams can't answer that question without running an audit. The Sentinel audit tells you — in 5–7 days, without system access, using your existing prediction logs.