How to scale your casino platform
Full article
1) What exactly should scale
Traffic and sessions: bursts from SEO/streams/tournaments (tens of thousands of RPS per read, thousands of RPS per write).
Money circuit: rates/settlements/deposits/cashouts - integrity and latency priority.
Payments: PSP routing, cascades, various geo and merchants.
Content: hundreds of providers, tens of thousands of gaming sessions in parallel.
Data: real-time KPI showcases and reporting without blocking OLTP.
Compliance: Real-time RG/AML/KYC.
2) Architectural foundations of scalability
2. 1 Layers and division of responsibilities
Edge: API-gateway, WAF/bot-protection, rate-limit, geo-filters.
Domain services: Wallet/Ledger, Cashier, Game Gateway, Bonus, RG, Risk/AML, PAM, Affiliates, CRM.
Data layer: event bus (Kafka/Pulsar), queues (SQS/Rabbit), caches (Redis), OLTP (Postgres/Oracle), OLAP (ClickHouse/BigQuery).
Observability/SecOps: metrics/trails/logs, SIEM/SOAR, Vault/HSM.
2. 2 Event model + CQRS
Commands (records) - strictly through domain APIs;
Read - through projections (indexed views/caches/OLAP).
Outbox/CDC: each event is published atomically with the transaction; analytics "listens" to the tire, not the combat database.
2. 3 Sagas and Idempotence
Long processes (deposit, cashout, bonus, tournament awards) - orchestrated by sagas.
All cash and bonus commands are idempotent (Idempotency-Key + deduplication).
3) Scaling of money ways (No. 1 by priority)
3. 1 Ledger as a standalone service
ACID-DB with double entry (debit/credit), immutable transactions, WORM audit log.
purse p95 <150ms, "lost/duplicated settlements" = 0.
3. 2 Cache helpers, but not true
Redis for limits, balance projections, locks on short sections; the wallet remains a source of truth.
Protection against cache stampede (TTL + jitter, single-flight).
3. 3 Horizontal scaling
Sharding along the player_id/brand_id/region, hot shards - into separate nodes.
Read-replicas for projections/back office; OLTP ↔ OLAP are separated.
4) PSP payments and orchestration on growth
Routing: by BIN/geo/scoring/value; dynamic channel reassessment.
Cascading: failure PSP1 → PSP2 without basket loss (idempotent tokens).
3-DS/AVS/velocity rules at the entrance; anti-fraud with graph links of cards/devices.
Reconciliations: auto-matching of PSP and ledger registers daily; discrepancy alerts.
5) Game Gateway and load "explosions"
A single gateway to providers (token-handshake, health check, degradation "no new sessions").
Back-pressure and queues for a settlement so that provider peaks do not put a wallet.
Rate-limit to the player/table/provider level; protection against "in-game tricks."
6) Data and analytics without production strangulation
Outbox/CDC → stream to DWH, SLA display case delivery ≤ 5 min.
Проекции KPI (FTD, NGR, ARPPU, Retention, LTV, Wager Progress, Risk flags) — в OLAP.
RLS/PII masking in storage; PII is held regionally (data residency).
7) Multi-region / Multi-brand
7. 1 Geographical stability
Asset-asset/asset-liability by region, RPO ≤ 5 min, RTO ≤ 30 min.
Geo-sharding PII/money (EU/UK/BR/...); Cross-region requests to PII are prohibited.
7. 2 Multibrand
General integrations (Game Gateway, Bonus, Affiliates) + isolated Ledger/Cashier/PII per license/region.
Mandatory keys' tenant _ id/brand _ id/license 'on the event bus.
8) Observability, reliability, chaos engineering
Metrics: p95/p99 latency per service, error rate, saturation, business metrics (bets/min, settle lag, deposit success).
Tracing: a single 'trace _ id' through edge → domains → bus → consumers.
Alerting by SLO: SLO-budget errors and managed degradation (bonus freezing, stop-new-sessions).
Chaos practices: regular PSP/provider/network files, checking cascades and sagas.
9) RG/AML/KYC scale
Synchronous stop signals at the rate (deposit/loss/time limits, self-exclusion).
Behavioral signal flows (long sessions, rate escalation), proactive notifications.
AML: sanclists/PEP, source of funds, SAR/STR - automated pipelines.
10) Security "default"
Zero-trust: mTLS, short-lived tokens, RBAC/ABAC, least rights principle.
Secrets - Vault/HSM; KMS encryption at-rest, PAN tokenization (PCI DSS).
WAF/bot protection, device-fingerprinting, DLP; immutable audit (WORM).
11) FinOps for splurge-free scalability
Autoscale by business metrics (bets/min, settle lag), not just CPU.
Spot/interruptible instances - for asynchronous consumers and ETL.
Quota limits, budget alerts; Tag service and brand costs
Profiling queries/indexes; TTL policies for logs/metrics.
12) Evolution roadmap (if starting from a monolith)
1. Enter the event bus and a single dictionary ('bet. placed`, `bet. settled`, `wallet. debit/credit`, `deposit. succeeded`).
2. Move Ledger to a separate service/database with outbox and idempotency.
3. Separate Cashier (PSP orchestration, cascades, reconciliations).
4. Put Game Gateway and degradation "no new sessions."
5. Convert Bonus/RG to event subscription; prohibit manual edits.
6. Post OLTP/OLAP and set up CDC streams in DWH.
7. Enable observability (SLO-dashboards, tracing) and chaos exercises.
8. Prepare multi-region (data/keys/measures/PII) - according to geo priorities.
13) SLO minimum for mature platform
Uptime of the kernel (Wallet/Cashier/Game Gateway) ≥ 99.95%.
p95 Ledger <150 ms; Cashier authorization <3 s; deposit success ≥ 85% in target geo.
"Lost/Duplicated Settlements" = 0.
Delivery of events to BI ≤ 5 min.
MTTR of core incidents <30 min.
14) Scalability Architect Checklist
- Domains are separated; money is a separate Ledger with outbox/CDC.
- Commands are idempotent; deduplication keys are everywhere.
- Game Gateway with back-pressure and degradation mode.
- Cashier: PSP cascades, retrays, reconciliations, fault telemetry.
- CQRS/projections; OLTP and OLAP are physically separate.
- Event bus with Schema Registry; contract versioning.
- RG/AML - synchronous brake lights; logs and reports are automated.
- Observability: metrics/trails/logs with 'trace _ id' and brand/tenant tags.
- DR plan: asset/liability, RPO ≤ 5 min, RTO ≤ 30 min; regular exercises.
- Security: mTLS, Vault/HSM, PCI/GDPR, WAF/bot protection, WORM audit.
- FinOps: autoscale by business metrics, budget alerts, cost tags.
15) Anti-patterns (red flags)
A single database "for everything," BI hits battle tables.
Manual edits of balances/bonuses in the database.
Publishing transaction bypass events (no outbox).
Lack of degradation: "either everything, or falling."
Payment failures without cascades and telemetry.
No idempotency; retrays create doublings of settlements.
Absence of PII geo-isolation and merchant keys.
Zero tracing: Investigations last for hours.
The scalability of the casino platform is not "more servers," but the correct boundaries and event operating model: an isolated and fast money loop, a stable payment layer, a gateway to games with controlled degradation, OLTP/OLAP separation, observability and SRE/FinOps discipline. On such a foundation, the platform will calmly live the peaks of tournaments, new geo and dozens of brands - without risk to players' money and reputation.