Case Study · Credit Card Fraud Detection

Detecting Silent Model Failure in a Production Fraud Detector

◈ Dataset: ULB Credit Card Fraud (Kaggle) ◈ 284,807 transactions ◈ XGBoost · Optuna-tuned (40 trials) ◈ Monitoring: Strataforge Sentinel

✓ Fully reproducible · Public dataset · Every number below is independently verifiable

0.977

Training ROC-AUC

60.0%

Recall at Period 5 (was 88.9%)

−29 pts

Recall drop across 5 periods

0.73

Optimal threshold (not 0.50 default)

The setup

A strong model. Trained, validated, deployed. Then what?

We trained a fraud classifier on the ULB Credit Card Fraud dataset — 284,807 transactions, only 492 of them fraud. This is a standard, widely-used benchmark, which matters here: it means every number below can be independently checked by anyone who downloads the same public dataset.

After feature engineering — 9 engineered features added to the original 28 PCA components (rolling stats, interaction terms, log-transformed amount) for 38 features total — and a 40-trial Optuna search optimizing for PR-AUC, the final model scored:

Training performance

ROC-AUC0.977

PR-AUC0.796

Decision threshold0.73

Tuning40-trial Optuna · PR-AUC objective

Feature engineering

Original features28 PCA components + Amount + Time

Engineered features+9 (rolling stats, interaction terms, Amount_log)

Total features38

Threshold basisCost-sensitivity analysis, not default 0.50

◈ Note on threshold

The optimal threshold of 0.73 — selected via cost-sensitivity analysis rather than the default 0.50 — alone reduced projected cost from $1,990 to $1,850 on the test set. The default threshold is almost never the right threshold in fraud detection.

The real question

Does it stay good?

A model's validation score is a snapshot. Production is not a snapshot — transaction patterns drift, fraud tactics evolve, and a model that scored 0.977 ROC-AUC on Tuesday can quietly degrade by Friday with no error, no crash, and no alert, unless something is specifically watching for it.

To test this, we ran the deployed model against five sequential slices of held-out data (~11,393 transactions each), simulating the kind of gradual population drift a live system would encounter over time, and tracked three things Sentinel monitors continuously: recall, feature-level PSI, and cost.

Period	Recall	Precision	Mean PSI	Top drifting feature	Status
P1	88.9%	84.2%	0.0022	V10 (PSI 0.010)	✅ Healthy
P2	78.3%	85.7%	0.0047	Amount_log (PSI 0.029)	✅ Healthy
P3	75.0%	100.0%	0.0070	Amount_log (PSI 0.049)	✅ Healthy
P4	75.0%	54.5%	0.0088	Amount_log (PSI 0.064)	⚠️ Watch
P5	60.0%	60.0%	0.0104	Amount_log (PSI 0.087)	🔴 RECALL_ALERT

Recall Decay — 5 Production Periods

P1 baseline 88.9% → P5 final 60.0% · Alert threshold: 70% · PSI 5× climb on Amount_log

Amount_log PSI — Feature Drift Signal (5× climb)

PSI signal elevated at P3 → P4 (Watch), firing RECALL_ALERT at P5 · Watch threshold: 0.10 · Alert: 0.25

🔴 Critical finding

By Period 5, recall had fallen from 88.9% to 60.0% — a 29-point drop — and PSI on the transaction-amount distribution had climbed nearly 5× from where it started. Sentinel flagged this automatically: RECALL_ALERT: 60.0% < 70% threshold

◈ Why this matters

This is the failure mode that matters most in fraud detection: the model doesn't break, it just quietly starts missing more fraud. Without a system watching recall and feature drift period-over-period, this decay is invisible until someone notices fraud losses climbing weeks later.

Proving production-readiness

Does the model survive the path into production?

A model can score well in the notebook that trained it and then behave differently once serialized, stored, and reloaded from a model registry — the exact path a model takes into production. This is a less-discussed failure mode, but a real one.

We pushed the trained model and scaler to a private Hugging Face repository, then pulled them back down cold — no shared state, no cached objects — loaded them, and re-scored a held-out validation set.

After Hugging Face reload

Rows scored56,961

ROC-AUC0.968

PR-AUC0.788

Recall (fraud class)77.2%

Precision (fraud class)62.9%

vs. Training-time scores

ROC-AUC delta−0.009 (within noise)

PR-AUC delta−0.008 (within noise)

Reload pathserialize → Hugging Face → cold pull → load → score

ResultConsistent ✓

✓ Result

Performance is within noise of training-time scores — confirming the model behaves consistently after the exact serialize → store → reload cycle a production deployment actually uses, not just inside the training notebook. This is what "production-validated" actually means.

What this demonstrates

Three things Sentinel proved here

A strong validation score is not the same as a reliable production system

This model's ROC-AUC never moved dramatically. The failure was entirely in recall and feature drift — metrics most dashboards don't surface by default. If you're only watching AUC, you're watching the wrong thing.

Silent degradation is detectable before it becomes a loss event

PSI on Amount_log was already elevated two periods before recall crossed the alert threshold. Sentinel's early signal gives you time to act — adjust threshold, investigate, or schedule retraining — before losses compound.

Threshold and cost tuning is not a one-time step

The optimal threshold at training time (0.73) was based on the training-period cost structure. As drift accumulates, that optimal shifts. Sentinel's cost-sensitivity analysis is designed to be re-run continuously, not set once and forgotten.

Methodology note

This evaluation uses the public ULB Credit Card Fraud dataset and simulated production-drift periods constructed from held-out data, not live customer transactions. It demonstrates Sentinel's monitoring methodology on a fully reproducible, independently verifiable benchmark. A second case study using the larger IEEE-CIS Fraud Detection dataset (590,540 transactions) is in progress.

Does your production model have a recall you can verify?

Most teams can't answer that question without running an audit. The Sentinel audit tells you — in 5–7 days, without system access, using your existing prediction logs.