← Case Study 1: ULB Credit Card Fraud · This is Case Study 2
  

Case Study 2 · IEEE-CIS Fraud Detection

Validating Model Stability
on a Harder, Higher-Volume
Fraud Dataset

◈ IEEE-CIS Fraud Detection (Kaggle) ◈ 590,540 transactions ◈ 434 raw columns ◈ XGBoost · Optuna 18 trials

✓ Public dataset · Reproducible · Independently verifiable

0.888

ROC-AUC (harder dataset)

0 alerts

False alarms across 5 periods

65–72%

Recall band — stable throughout

+21 pts

Recall recovered via threshold (0.15)

Why a second dataset

A monitoring system must also know when to stay silent.

Our first case study showed Sentinel catching a real recall collapse on the ULB Credit Card Fraud dataset. A monitoring system is only useful if its alerts mean something — which means it also needs to correctly say nothing is wrong when nothing is wrong.

IEEE-CIS Fraud Detection is a good test of exactly that: larger (590,540 transactions), messier (434 raw columns including transaction metadata and device/identity data), and with a fraud rate 20× higher than the Credit Card dataset (3.499% vs 0.173%).

◈ The two-sided test

Case Study 1 proved Sentinel catches a real failure. Case Study 2 proves Sentinel doesn't manufacture false alarms. You need both to trust a monitoring system. A system that fires on everything is as useless as one that fires on nothing.

The model

Harder dataset, harder problem — expected and handled.

After dropping columns with more than 90% missing values, encoding 30 categorical fields, and engineering log-transformed amount and zero-amount indicator features (425 columns total), we ran an 18-trial Optuna search optimizing PR-AUC.

Test performance (default threshold 0.50)

ROC-AUC0.888

PR-AUC0.509

Recall49.3%

Precision47.7%

Model configuration

ArchitectureXGBoost

max_depth / lr9 / 0.135

n_estimators303

Top featuresV258, V218, V70, V257, V294

Why lower ROC-AUC is expected here

The Credit Card model scored 0.977 ROC-AUC. This model scores 0.888. That's not a step backward — IEEE-CIS has a 3.5% fraud rate (20× higher than Credit Card's 0.17%), plus far noisier anonymized features (V258, V218, etc. dominate importance, with no interpretable meaning). A model catching ~half of fraud at default threshold 0.50 is a realistic starting point — exactly where threshold optimization makes its biggest difference.

Threshold optimization

Default threshold left 21 points of recall on the table.

As with the Credit Card model, the default 0.50 threshold was the wrong threshold. Sentinel's cost-aware threshold analysis found the optimal decision boundary at 0.15 — a larger swing than Case Study 1's correction, because higher fraud rates change the optimal cost structure significantly.

Threshold	Recall	Precision	Projected cost
0.50 (default)	49.3%	47.7%	$164,645
0.15 (optimal)	70.2%	21.1%	$132,485

◈ Result

Moving to the cost-optimal threshold recovered 21 points of recall and reduced projected cost by $32,160 on the test set. The default decision boundary almost never reflects the actual fraud-rate and cost structure of a production system.

Production drift monitoring

Five periods. No false alarms. Sentinel stayed silent.

Using the 0.15 threshold, we ran the same five-period drift simulation used in Case Study 1. Period-to-period fraud rate ranged from 2.99% to 4.22% — genuine, non-trivial variation. The question: would Sentinel fire a false alarm?

Period	Fraud rate	Recall	Precision	Mean PSI	Status
P1	2.99%	71.8%	20.2%	0.0127	✅ Healthy
P2	3.23%	70.3%	19.2%	0.0028	✅ Healthy
P3	3.36%	65.3%	18.7%	0.0045	✅ Healthy
P4	3.61%	72.5%	22.9%	0.0103	✅ Healthy
P5	4.22%	70.8%	24.2%	0.0075	✅ Healthy

Recall Band — All Periods Within Safe Zone (60% floor)

Alert floor: 60% · Band: 65.3–72.5% · All 5 periods within safe zone

◈ Key result

Recall held in a tight 65.3–72.5% band across all five periods. No period dropped below the 60% alert threshold. PSI stayed well under the levels that preceded the Credit Card model's Period 5 breakdown (0.0104 there vs. a peak of 0.0127 here, without the corresponding recall collapse). Zero false alarms fired.

What the two case studies together prove

One model failed. One held. Sentinel told the difference.

🔴 Case Study 1 — ULB Credit Card

✕ Recall: 88.9% → 60.0% (−29 pts)

✕ PSI climbed 5× on Amount_log

✕ RECALL_ALERT fired at Period 5

✓ Sentinel caught it correctly

✅ Case Study 2 — IEEE-CIS

✓ Recall: 65.3–72.5% (stable band)

✓ PSI stayed within safe range

✓ No alerts — 5/5 periods healthy

✓ Sentinel stayed silent correctly

Together, these two studies apply the same monitoring methodology to two different outcomes: one model degrading, correctly flagged; one model holding steady, correctly left alone. That's the minimum viable test of whether a monitoring system is actually measuring something real rather than generating noise.

Methodology note

This evaluation uses the public IEEE-CIS Fraud Detection dataset and simulated production-drift periods constructed from held-out test data, not live customer transactions. Alert thresholds (60% recall floor) are set per-model based on validated baseline performance, not a fixed universal number.

Can your production model tell you if it's still working?

A Sentinel audit answers that in 5–7 days — without system access, using your existing prediction logs.