Failover, replication and DR plans for casinos

1) Business objectives: RTO/RPO and critical flow

RTO (how long the service may be unavailable): login/rate/deposit - seconds/minutes; reports - hours.

RPO (how much data can be lost): wallet/transactions - ~ 0-30 seconds; telemetry - minutes.

Critical flow: login, deposit/withdrawal, bet/settlement, KYC/AML-collars, PSP/game provider webhooks.

2) Architectural fault tolerance patterns

Active-Active (multi-region): both regions handle traffic; low RTO/RPO, complex consistency.

Active-Standby: one region in operation, the second hot; easier state, RTO minutes.

Cell-based: isolation by "cells" (market/brand), local incidents do not bring everything down.

Edge pie: Anycast CDN/WAF → regional gateways → app clusters → DB/caches with replication.

3) Traffic management and network fake

Anycast + CDN/WAF: L3/4/7 uptake, health check for origin.

DNS-feilover (low TTL, multi-value), Traffic Manager/GSLB on health metrics.

BGP announcement via anti-DDoS provider for fast path change.

Health check (example of logic):


if p95_latency>threshold          5xx_rate>threshold          synthetic_login_fail:
drain(region_A); shift(traffic->region_B, ramp=5min)

4) Data: wallet, orders, bets

The source of truth is the ledger: append only, idempotence by 'operation _ id'.

Reconciliation: periodic reconciliation jobs between ledger, PSP and game providers.

Anti-double: idempotency keys for deposits/sausages/payments; deduplication to outbox/inbox.

5) Database Replication - Options and Tradeoffs

Physical synchronous (semi-sync): minimal RPO, risk of delays - apply pointwise (wallet).

Asynchronous: higher performance/simplicity, RPO seconds-minutes - for game metadata, reference books.

Logical (CDC → stream to another region): flexible selectivity, convenient for cross-engines and analytics.

Caches (Redis/Memcached): not as a source of truth; replica/snapshots, warm starts.

PITR: continuous logs (WAL/redo) to offsite storage, recovery window ≥ 7-30 days.

6) Consistency and reconciliation patterns

Saga + Outbox: business transactions as a chain of steps, publishing events atomically with writing to the database.

Exactly-once "in meaning": idempotency of operations, control of balance versions (optimistic locking).

Eventual consistency in non-key flow (leader board, analytics); strong for money.

7) Components and their feilover

API/backend

Statles containers, autoscale, blue-green/canary; configs through storage (with versioning).

Queues/Streams

Quorum clusters (N = 3/5), cross-AZ replica; redo policies and dlt queues.

Wallet DB

Primari in Region A, sync replica in A (other AZ), asynchronous in Region B; automatic promote with split-brain is prohibited - only manual/scripted with a checklist.

Files/CUS Artifacts

Object storage with versioning, cross-regional replica/CRR, keys in KMS.

WebSocket/Real-time

Sharding by keys (table/game/market), sticky-routing; with a feiler - resubscribe with a rejoin token.

8) Payments and game providers: Many sources of truth

PSP-feilover: at least 2 providers for each method (card, wallets, crypto).

Percentage routing by SLA/value/banlists BIN; deactivation of the degraded PSP by the automatic circuit breaker.

Game providers: backup channels/ASN allow-list, individual keys to regions, isolation of timeouts.

9) Webhooks and sausages: sustainable reception and reproduction

Inbox-pattern: we accept the webhook → check the signature/NMAS → write in immutable-inbox → process the worker idempotently.

Retrays of providers: backoff + dedup by 'event _ id '/' signature'.

In DR: replay from inbox with order control (txn → settlement).

10) Backups: 3-2-1 strategy and recovery checks

3 copies/2 media/1 offsite (and 1 offline/WORM for critical journals).

Schedules: daily snapshots + permanent magazines; weekly test-restore to the "dark" stand.

Recovery directories: "how to raise your wallet at the time of t- Δ."

11) DR plan: roles, scenarios, communications

Роли: Incident Commander, Comms, DB Lead, App Lead, Payments/Game PM, SRE Oncall.

Channels: war-room, status page, message templates for support/partners/affiliates.

Scenarios (minimum):

Loss of AZ, loss of region, PSP unavailability, database cluster drop, game provider degradation, key leak, massive 5xx.

12) Example of DR scenario matrix

Scenario	Detect	Actions	RTO	RPO	Yield criterion
Region A is not available	Synthetics+GSLB	Shift traffic in B, promote database, disable heavy features	10-20 min	≤30 sec	p95 OK, 5xx<0. 5%
PSP-1 degradation	Errors 3DS/timeout	Switching routing to PSP-2, enable limits	2-5 min	0	Success rate>99%
Wallet database failure	Heartbeat/replication lag	Promote standby, ledger verification, enable hold on pins	5-10 min	≤5 sec	Ledger=OK
Games provider lag	RTT/start-up time	Switch traffic to alternative desks/provider	1-3 min	0	TTFS <800 ms

13) Runbook's and Automation

"DR-cutover" button: sequence of steps with validation (freeze writes → promote → warm caches → ramp traffic).

Integrity check scripts: reconciliation of ledger/wallet amounts, balance consistency.

Feature-flags: quickly disable reports/exports/heavy dashboards during an accident.

14) Observability for a feilover

SLO metrics as triggers: login, deposit, bet, game launch.

Технические: replication-lag, WAL-shipping, queue-lag, 5xx, p95, SYN backlog, WebSocket disconnects.

Synthetic scenarios from other regions: login/deposit/bet every minute.

End-to-end traces, 'region', 'psp', 'game _ provider' tags.

15) Chaos/DR exercises

GameDay quarterly: disconnection of AZ, degradation of PSP, "loss" of the database node, queue stop.

Retrospective: decision time, missing alerts, noise, bottlenecks.

Adjusting RTO/RPO and automation based on facts, not "sensations."

16) Safety and compliance

Keys/secrets in KMS/HSM (cross-regional), rotation and dual-control.

WORM/immunity for audit and transaction logs.

DPA/PSP/provider contracts for SLA/DR commitments and 24 × 7 contact points.

17) Example of Feilover Minimum Policy (Pseudocode)


on Incident(type="REGION_DOWN"):
freeze_non_critical_writes()
promote_db(region=B)
verify_ledger_consistency()
warm_caches(region=B)
route_traffic(region=B, ramp=10%)
for step in [25%, 50%, 100%]:
if SLO_green(): ramp(step) else rollback()
announce_statuspage()

18) Prod-ready checklist

Defined RTO/RPO per flow; accepted by business.
Multi-AZ minimum; Multi-region for wallet, login and payments.
Ledger + idempotency (keys) + outbox/inbox; reconciliation on a schedule.
Database replication: sync locally, async in DR; PITR enabled, restore checked.
Two PSPs per method, routing policy and test keys; game providers are alternatives.
DNS/GSLB/Anycast, health checks and synthetics, low TTL.
Runbook and DR-cutover button, feature-flags for degradation.
SLO/alerts/tracing; DR Status panel.
Quarterly DR exercises + retro; updated contacts 24 × 7.

Resume Summary

A reliable iGaming platform is built around a monetary circuit: a journal of postings with idempotency, a predictable feiler, verifiable replication and regular DR exercises. Divide the system into cells and regions, automate cutover, keep two PSPs and spare game providers, monitor SLO and ledger integrity - and even a major accident will become a manageable event without losing trust and money.