Failover, replication and DR plans for casinos
1) Business objectives: RTO/RPO and critical flow
RTO (how long the service may be unavailable): login/rate/deposit - seconds/minutes; reports - hours.
RPO (how much data can be lost): wallet/transactions - ~ 0-30 seconds; telemetry - minutes.
Critical flow: login, deposit/withdrawal, bet/settlement, KYC/AML-collars, PSP/game provider webhooks.
2) Architectural fault tolerance patterns
Active-Active (multi-region): both regions handle traffic; low RTO/RPO, complex consistency.
Active-Standby: one region in operation, the second hot; easier state, RTO minutes.
Cell-based: isolation by "cells" (market/brand), local incidents do not bring everything down.
Edge pie: Anycast CDN/WAF → regional gateways → app clusters → DB/caches with replication.
3) Traffic management and network fake
Anycast + CDN/WAF: L3/4/7 uptake, health check for origin.
DNS-feilover (low TTL, multi-value), Traffic Manager/GSLB on health metrics.
BGP announcement via anti-DDoS provider for fast path change.
Health check (example of logic):
if p95_latency>threshold 5xx_rate>threshold synthetic_login_fail:
drain(region_A); shift(traffic->region_B, ramp=5min)
4) Data: wallet, orders, bets
The source of truth is the ledger: append only, idempotence by 'operation _ id'.
Reconciliation: periodic reconciliation jobs between ledger, PSP and game providers.
Anti-double: idempotency keys for deposits/sausages/payments; deduplication to outbox/inbox.
5) Database Replication - Options and Tradeoffs
Physical synchronous (semi-sync): minimal RPO, risk of delays - apply pointwise (wallet).
Asynchronous: higher performance/simplicity, RPO seconds-minutes - for game metadata, reference books.
Logical (CDC → stream to another region): flexible selectivity, convenient for cross-engines and analytics.
Caches (Redis/Memcached): not as a source of truth; replica/snapshots, warm starts.
PITR: continuous logs (WAL/redo) to offsite storage, recovery window ≥ 7-30 days.
6) Consistency and reconciliation patterns
Saga + Outbox: business transactions as a chain of steps, publishing events atomically with writing to the database.
Exactly-once "in meaning": idempotency of operations, control of balance versions (optimistic locking).
Eventual consistency in non-key flow (leader board, analytics); strong for money.
7) Components and their feilover
API/backend
Statles containers, autoscale, blue-green/canary; configs through storage (with versioning).
Queues/Streams
Quorum clusters (N = 3/5), cross-AZ replica; redo policies and dlt queues.
Wallet DB
Primari in Region A, sync replica in A (other AZ), asynchronous in Region B; automatic promote with split-brain is prohibited - only manual/scripted with a checklist.
Files/CUS Artifacts
Object storage with versioning, cross-regional replica/CRR, keys in KMS.
WebSocket/Real-time
Sharding by keys (table/game/market), sticky-routing; with a feiler - resubscribe with a rejoin token.
8) Payments and game providers: Many sources of truth
PSP-feilover: at least 2 providers for each method (card, wallets, crypto).
Percentage routing by SLA/value/banlists BIN; deactivation of the degraded PSP by the automatic circuit breaker.
Game providers: backup channels/ASN allow-list, individual keys to regions, isolation of timeouts.
9) Webhooks and sausages: sustainable reception and reproduction
Inbox-pattern: we accept the webhook → check the signature/NMAS → write in immutable-inbox → process the worker idempotently.
Retrays of providers: backoff + dedup by 'event _ id '/' signature'.
In DR: replay from inbox with order control (txn → settlement).
10) Backups: 3-2-1 strategy and recovery checks
3 copies/2 media/1 offsite (and 1 offline/WORM for critical journals).
Schedules: daily snapshots + permanent magazines; weekly test-restore to the "dark" stand.
Recovery directories: "how to raise your wallet at the time of t- Δ."
11) DR plan: roles, scenarios, communications
Роли: Incident Commander, Comms, DB Lead, App Lead, Payments/Game PM, SRE Oncall.
Channels: war-room, status page, message templates for support/partners/affiliates.
Scenarios (minimum):- Loss of AZ, loss of region, PSP unavailability, database cluster drop, game provider degradation, key leak, massive 5xx.
12) Example of DR scenario matrix
13) Runbook's and Automation
"DR-cutover" button: sequence of steps with validation (freeze writes → promote → warm caches → ramp traffic).
Integrity check scripts: reconciliation of ledger/wallet amounts, balance consistency.
Feature-flags: quickly disable reports/exports/heavy dashboards during an accident.
14) Observability for a feilover
SLO metrics as triggers: login, deposit, bet, game launch.
Технические: replication-lag, WAL-shipping, queue-lag, 5xx, p95, SYN backlog, WebSocket disconnects.
Synthetic scenarios from other regions: login/deposit/bet every minute.
End-to-end traces, 'region', 'psp', 'game _ provider' tags.
15) Chaos/DR exercises
GameDay quarterly: disconnection of AZ, degradation of PSP, "loss" of the database node, queue stop.
Retrospective: decision time, missing alerts, noise, bottlenecks.
Adjusting RTO/RPO and automation based on facts, not "sensations."
16) Safety and compliance
Keys/secrets in KMS/HSM (cross-regional), rotation and dual-control.
WORM/immunity for audit and transaction logs.
DPA/PSP/provider contracts for SLA/DR commitments and 24 × 7 contact points.
17) Example of Feilover Minimum Policy (Pseudocode)
on Incident(type="REGION_DOWN"):
freeze_non_critical_writes()
promote_db(region=B)
verify_ledger_consistency()
warm_caches(region=B)
route_traffic(region=B, ramp=10%)
for step in [25%, 50%, 100%]:
if SLO_green(): ramp(step) else rollback()
announce_statuspage()
18) Prod-ready checklist
- Defined RTO/RPO per flow; accepted by business.
- Multi-AZ minimum; Multi-region for wallet, login and payments.
- Ledger + idempotency (keys) + outbox/inbox; reconciliation on a schedule.
- Database replication: sync locally, async in DR; PITR enabled, restore checked.
- Two PSPs per method, routing policy and test keys; game providers are alternatives.
- DNS/GSLB/Anycast, health checks and synthetics, low TTL.
- Runbook and DR-cutover button, feature-flags for degradation.
- SLO/alerts/tracing; DR Status panel.
- Quarterly DR exercises + retro; updated contacts 24 × 7.
Resume Summary
A reliable iGaming platform is built around a monetary circuit: a journal of postings with idempotency, a predictable feiler, verifiable replication and regular DR exercises. Divide the system into cells and regions, automate cutover, keep two PSPs and spare game providers, monitor SLO and ledger integrity - and even a major accident will become a manageable event without losing trust and money.