How We Built a Real-Time Fraud Detection System That Processes 50,000 TPS

When a Series B fintech client asked Vextrosys to replace a batch fraud system that cleared transactions in 15-45 minutes, the target was blunt: sub-200ms p99 scoring at 50,000 transactions per second during peak settlement windows, without blocking legitimate high-value transfers. This is how we designed, shipped, and operated that pipeline in production for nine months.

The problem framing: latency vs. recall

Fraud teams care about two curves that move in opposite directions. Tightening rules reduces fraud loss but increases false positives - each false positive is a support ticket, a churn risk, and often a regulatory complaint. Our north-star metric was expected fraud loss per $1M GMV, with guardrails on false-positive rate (FPR) below 0.08% for card-not-present and manual review queue depth under 2% of volume.

Hard decline: model score above 0.92 with two corroborating rule signals
Soft decline + step-up auth: score 0.75-0.92 or velocity anomaly without device match
Allow with async enrichment: score below 0.75; shadow scoring for model drift monitoring

Vextrosys principle

Never ship a single-threshold classifier to production. Layer deterministic rules for known fraud patterns (stolen BIN lists, geo-velocity impossibilities) under a probabilistic model so you can explain every decline to compliance.

Architecture overview

We chose an event-driven topology: payment gateways emit canonical TransactionAuthorized events to Kafka; a Flink job maintains rolling feature windows; an online feature store serves point-in-time correct vectors; a Triton ensemble returns scores; decisions are written to DynamoDB and emitted to a case-management topic for analyst review.

Payment GW → Kafka (txn.authorized.v2)
              ↓
         Flink (5m / 1h / 24h windows)
              ↓
    Feast online store (Redis cluster)
              ↓
   Triton: GBDT + shallow NN ensemble
              ↓
 Decision service → DynamoDB + SNS (cases)

Why Flink over Spark Streaming

Spark Structured Streaming would have worked for 10k TPS, but keyed state and custom window triggers for device fingerprint cardinality were simpler in Flink. We run 32 TaskManagers with RocksDB state on NVMe, checkpointing every 60s to S3 with exactly-once semantics aligned to Kafka transactional IDs.

Ingest: Avro schemas in Schema Registry; reject unknown fields at the edge
Normalize: map 14 gateway payload shapes to internal Transaction v3
Enrich: IP reputation, device graph lookup, merchant risk tier from Postgres read replica
Score: batch requests of 64 transactions to Triton to amortize gRPC overhead
Decide: rules engine (Drools) + model score fusion
Audit: immutable decision log to S3 via Firehose for 7-year retention

Feature engineering that actually moved AUC

The biggest lift came from graph-derived features, not from swapping XGBoost for a deep network. We built a device-user-merchant bipartite graph updated in Flink and materialized counts: unique devices per user in 24h, failed auth attempts per device, amount velocity vs. 30-day P95 for that merchant category.

Point-in-time joins via Feast: training used the same timestamps as online serving
Embeddings for merchant MCC + country pairs (32-dim) from a weekly Word2Vec job
Recency-weighted aggregates: exponential decay half-life 6h for velocity features
Cold-start bucket: new users routed to a conservative rules-heavy policy for 72h

# Feast feature view (simplified)
txn_velocity_5m = FeatureView(
    name="txn_velocity_5m",
    entities=[transaction],
    ttl=timedelta(hours=25),
    schema=[Field("amount_sum_5m", Float32), Field("count_5m", Int32)],
    online=True,
    source=KafkaFlinkSource(...),
)

Model stack and calibration

Production ensemble: calibrated XGBoost (600 trees, max_depth=8) for tabular features plus a 3-layer MLP on embeddings. We retrain weekly on rolling 90-day windows with stratified sampling to preserve rare fraud classes. Platt scaling on a holdout week prevents score drift from breaking rule thresholds.

The model that wins offline is rarely the model you can explain to a regulator. We kept the GBDT as the primary narrative layer and used the NN only where AUC gain exceeded 0.4 points on the fraud slice.

Champion/challenger and shadow mode

Every new model version runs in shadow for 72 hours minimum: scores logged, no customer impact. We promote when fraud capture rate improves ≥1.5% at fixed FPR on shadow traffic, and p99 latency stays under 180ms. Rollback is a feature flag on the decision service, not a redeploy.

Production incident lesson

A Redis cluster failover during AZ maintenance caused 400ms feature lookup spikes. We added local LRU caches for top 10k merchant/device keys and circuit-breakers that fall back to rules-only mode - better to step-up auth than time out the payment path.

MLOps, monitoring, and governance

Evidently AI dashboards: PSI on top 40 features, score distribution, segment FPR
PagerDuty on p99 latency, Kafka consumer lag, and fraud $/hour vs. 7-day baseline
Model cards + bias review for country and merchant-size segments quarterly
Analyst feedback loop: confirmed fraud labels backfill training within 24h via Airflow

Results and tradeoffs

At steady state: p99 scoring 162ms, 50.3k peak TPS sustained, fraud capture +23% vs. legacy rules at the same FPR, infrastructure cost ~$47k/month on AWS (Flink + Redis + Triton on g5.xlarge). The tradeoff we accept: operational complexity. This is not a "deploy sklearn to Lambda" problem - it requires a platform team and on-call rotation.

Start with rules + batch scores if you are under 2k TPS - validate economics first
Invest in canonical events and feature store before model sophistication
Plan for explainability and audit from day one, not after the first regulatory exam
Budget 20% engineering time for analyst tooling; labels are your real bottleneck

If you are scaling past 10k TPS or entering a new market with different fraud typologies, we typically run a two-week architecture assessment before writing Flink jobs. The patterns above are battle-tested; the feature definitions are always domain-specific.

How We Built a Real-Time Fraud Detection System That Processes 50,000 TPS

The problem framing: latency vs. recall

Architecture overview

Why Flink over Spark Streaming

Feature engineering that actually moved AUC

Model stack and calibration

Champion/challenger and shadow mode

MLOps, monitoring, and governance

Results and tradeoffs

Continue reading

Building RAG Systems That Actually Work: Beyond the Hello World Tutorial

Flutter vs React Native in 2026: An Honest Comparison After 50+ Projects

Multi-Tenant SaaS Architecture: The Decisions That Matter Most at Scale