When a Series B fintech client asked Vextrosys to replace a batch fraud system that cleared transactions in 15-45 minutes, the target was blunt: sub-200ms p99 scoring at 50,000 transactions per second during peak settlement windows, without blocking legitimate high-value transfers. This is how we designed, shipped, and operated that pipeline in production for nine months.
The problem framing: latency vs. recall
Fraud teams care about two curves that move in opposite directions. Tightening rules reduces fraud loss but increases false positives - each false positive is a support ticket, a churn risk, and often a regulatory complaint. Our north-star metric was expected fraud loss per $1M GMV, with guardrails on false-positive rate (FPR) below 0.08% for card-not-present and manual review queue depth under 2% of volume.
- Hard decline: model score above 0.92 with two corroborating rule signals
- Soft decline + step-up auth: score 0.75-0.92 or velocity anomaly without device match
- Allow with async enrichment: score below 0.75; shadow scoring for model drift monitoring
Vextrosys principle
Never ship a single-threshold classifier to production. Layer deterministic rules for known fraud patterns (stolen BIN lists, geo-velocity impossibilities) under a probabilistic model so you can explain every decline to compliance.
Architecture overview
We chose an event-driven topology: payment gateways emit canonical TransactionAuthorized events to Kafka; a Flink job maintains rolling feature windows; an online feature store serves point-in-time correct vectors; a Triton ensemble returns scores; decisions are written to DynamoDB and emitted to a case-management topic for analyst review.
Payment GW → Kafka (txn.authorized.v2)
↓
Flink (5m / 1h / 24h windows)
↓
Feast online store (Redis cluster)
↓
Triton: GBDT + shallow NN ensemble
↓
Decision service → DynamoDB + SNS (cases)Why Flink over Spark Streaming
Spark Structured Streaming would have worked for 10k TPS, but keyed state and custom window triggers for device fingerprint cardinality were simpler in Flink. We run 32 TaskManagers with RocksDB state on NVMe, checkpointing every 60s to S3 with exactly-once semantics aligned to Kafka transactional IDs.
- Ingest: Avro schemas in Schema Registry; reject unknown fields at the edge
- Normalize: map 14 gateway payload shapes to internal Transaction v3
- Enrich: IP reputation, device graph lookup, merchant risk tier from Postgres read replica
- Score: batch requests of 64 transactions to Triton to amortize gRPC overhead
- Decide: rules engine (Drools) + model score fusion
- Audit: immutable decision log to S3 via Firehose for 7-year retention
Feature engineering that actually moved AUC
The biggest lift came from graph-derived features, not from swapping XGBoost for a deep network. We built a device-user-merchant bipartite graph updated in Flink and materialized counts: unique devices per user in 24h, failed auth attempts per device, amount velocity vs. 30-day P95 for that merchant category.
- Point-in-time joins via Feast: training used the same timestamps as online serving
- Embeddings for merchant MCC + country pairs (32-dim) from a weekly Word2Vec job
- Recency-weighted aggregates: exponential decay half-life 6h for velocity features
- Cold-start bucket: new users routed to a conservative rules-heavy policy for 72h
# Feast feature view (simplified)
txn_velocity_5m = FeatureView(
name="txn_velocity_5m",
entities=[transaction],
ttl=timedelta(hours=25),
schema=[Field("amount_sum_5m", Float32), Field("count_5m", Int32)],
online=True,
source=KafkaFlinkSource(...),
)Model stack and calibration
Production ensemble: calibrated XGBoost (600 trees, max_depth=8) for tabular features plus a 3-layer MLP on embeddings. We retrain weekly on rolling 90-day windows with stratified sampling to preserve rare fraud classes. Platt scaling on a holdout week prevents score drift from breaking rule thresholds.
The model that wins offline is rarely the model you can explain to a regulator. We kept the GBDT as the primary narrative layer and used the NN only where AUC gain exceeded 0.4 points on the fraud slice.
Champion/challenger and shadow mode
Every new model version runs in shadow for 72 hours minimum: scores logged, no customer impact. We promote when fraud capture rate improves ≥1.5% at fixed FPR on shadow traffic, and p99 latency stays under 180ms. Rollback is a feature flag on the decision service, not a redeploy.
Production incident lesson
A Redis cluster failover during AZ maintenance caused 400ms feature lookup spikes. We added local LRU caches for top 10k merchant/device keys and circuit-breakers that fall back to rules-only mode - better to step-up auth than time out the payment path.
MLOps, monitoring, and governance
- Evidently AI dashboards: PSI on top 40 features, score distribution, segment FPR
- PagerDuty on p99 latency, Kafka consumer lag, and fraud $/hour vs. 7-day baseline
- Model cards + bias review for country and merchant-size segments quarterly
- Analyst feedback loop: confirmed fraud labels backfill training within 24h via Airflow
Results and tradeoffs
At steady state: p99 scoring 162ms, 50.3k peak TPS sustained, fraud capture +23% vs. legacy rules at the same FPR, infrastructure cost ~$47k/month on AWS (Flink + Redis + Triton on g5.xlarge). The tradeoff we accept: operational complexity. This is not a "deploy sklearn to Lambda" problem - it requires a platform team and on-call rotation.
- Start with rules + batch scores if you are under 2k TPS - validate economics first
- Invest in canonical events and feature store before model sophistication
- Plan for explainability and audit from day one, not after the first regulatory exam
- Budget 20% engineering time for analyst tooling; labels are your real bottleneck
If you are scaling past 10k TPS or entering a new market with different fraud typologies, we typically run a two-week architecture assessment before writing Flink jobs. The patterns above are battle-tested; the feature definitions are always domain-specific.