How failover and backup work in iGaming

Why do iGaming need a special DR/BCP discipline

A casino platform is real-time money (wallet/ledger), live rounds (RNG/Live), payments, affiliates and strict compliance. Any accessibility hole quickly turns into financial and legal risks. Therefore, the architecture is built around predictable recovery: known goals, known scenarios, rehearsed procedures.

Basic objectives and terms

RTO - Recovery Time Objective

For wallet/ledger: ≤ 60-300 sec (intraregional feilover), ≤ 15 min (interregional DR).

Recovery Point Objective (RPO) - acceptable data loss.

For the ledger: 0-5 seconds (synchronous/quasi-synchronous replication), for reporting: ≤ 15 minutes.

SLA and Error Budget: Formalize trade-offs between rate of change and stability.

Fault tolerance layers

1) Infrastructure: Multi-AZ/Multi-Region

Multi-AZ (minimum 3 zones): all critical services are distributed by zones, automatic database/bus failover.

Multi-Region DR: "hot" (Active-Active) or "warm" (Active-Passive) second region with isolation by jurisdiction (data residency).

Deciding when which mode:

Active-Active: low latency to players in two regions, cross-region ledger through event synchronization + strict single "place of truth" for calculations.
Active-Passive (warm): simpler and cheaper; the passive holds warm instances + database replicas, but does not serve traffic.

2) Network and perimeter

Duplicated ingress/WAF, Anycast or DNS feilover with health checks.

Separate egress gateways for cash registers and providers, lists of allowed IP in both regions.

3) Data and queues

Relational databases (Postgres): Patroni/Managed HA, synchronous replicas in AZ, asynchronous replica in DR region (with lag monitoring). PITR with snapshots every N minutes + WAL archive.

OLAP (ClickHouse/BigQuery): replication/sharding; loss is acceptable above (RPO up to 15-30 min).

Cache (Redis): a cluster with failover, but not a source of truth; during switching - warm-up.

Event bus (Kafka/NATS): mirror clusters and/or cross-cluster-mirroring, at-least-once warranty, idempotency control on consumers.

4) Applications and domains

Wallet/ledger: stateful core with strict consistency, one "master writer" per region; with interregional DR - the "elected writer" procedure with a double entry lock.

Game bridge/API: stateless, horizontal feiler for health checks; idempotencyKey for all financial paths.

Bonuses/notifications/ETL: allow delayed processing, restart from queues.

Box office (PSP/crypt): multi-provider strategy (at least 2 rails per country), fast switching of merchants/endpoints.

5) Live streams

WebRTC/LL-HLS gateways with regional edge nodes; fallback routes on LL-HLS under WebRTC degradation.

Keeping the betting logic outside the player so that restarting the stream does not affect the calculation.

Failover-patterns

Asset-asset (bi-regional)

Pros: Minimal RTO/RPO, proximity to players.

Cons: ledger complexity and recording conflicts, expensive grid.

Practice: "one writer per domain" + event sourcing to reproduce states in a neighboring region.

Asset-liability (warm)

Pros: Price/difficulty balance.

Cons: RTO above, need a proven plan to "promote" a passive region.

Practice: automation + manual confirmation (4-eye principle) when switching wallet.

Intraregional (Multi-AZ)

Database/cache/ingress autofailer.

No DNS/Anycast change, RTO seconds-minutes.

Backup by Data Class

Class	Examples	Method	Frequency	Storage	Verification
Money Transactions/Ledger	Postgres (wallet, ledger)	Snapshots + WAL archive (PITR), logical replica	5-15 min WAL, snapshot 1-4 h	Object Storage with Object Lock (WORM), Cross Region	Weekly DR Cold + Checksum Comparison
Events	Kafka topiks	Tiered storage + mirror в DR	Continuously	Object storage	Replay test windows
OLAP/Reporting	ClickHouse/BigQuery	Snapshots/export of batches	1-6 hours	Object storage	Reading test samples
Static artifacts	tickets, logs, export	Versioned S3, Glacier	Daily	WORM/versions	Periodic restore
Secrets/Keys	KMS/HSM Metadata	Export with wrapper, dual-control	On schedule	HSM remarks	Decryption test

Principles:

The backup is encrypted at rest and in transit, the keys are encrypted in KMS/HSM.
Immutable mode (WORM) for critical backups (erasure protection/ransomware).
Catalog of backups with metadata (schema version, WAL window, checksums).
PITR is mandatory for the ledger.

Data and idempotence: how to avoid "holes" with a feiler

IdempotencyKey on 'bet' requests. place`, `payout. request`, `cashier. webhook`.

Ledger - only append-only: the repeated settle will create a correction entry, not a "rewrite."

Transactional locks/balance versioning protect against racing when switching writer roles.

Event deduplication (consumer-side, hash by key fields).

Cash register, PSP and crypt: plan B is always included

At least two providers for the payment method (card/AWP), pre-established merchant accounts in both regions.

For stablecoins - two networks (for example, TRC-20 and ERC-20) and two on/off-ramp providers.

Payout router: in case of failure, the PSP instantly switches to the backup, keeps a log of the reasons.

KYT/AML streams are duplicated; if the external service is not available - "degraded mode" with manual escalation.

Operational Procedures (Runbooks)

Automatic

Health check chain ingress → API → wallet → database → provider.

Automatic disabling of "heavy" functions (tournaments/missions) when the wallet is degraded.

Timeouts/retreats with exponential pause and strict deadlines.

Manual (with confirmation)

Promotion of the DR-region into an asset: checklists by steps, logging, com-templates (support/partners/regulator).

Compensation/VOID by rounds: cause codes, links to the video guide, signature of those responsible.

Defrosting payments with double control.

Exercises and readiness checks

Game Day/Chaos Drill monthly: turning off AZ, database degradation, provider drop.

Full DR Rehearsal quarterly: raise the DR region "in full growth," run real scenarios of bets/payments.

Restore tests: restore the ledger to time T, check with the control P&L and hash slices.

Table-top with compliance: who and whom notifies which reports are generated (regulator, PSP, affiliates).

Observability and feilover signals

SLO metrics: purse p95 latency, share'bet. rejected ', round settle time, payout SLA, database replication lag, Kafka consumer lag.

Switching events: alerts "role change," "replication lag> X," "object-lock violation."

DR dashboards: current node role, RPO score (WAL minutes), PITR window status.

Safety and compliance

Data isolation by jurisdiction (EU/UK/CA/...): replication within legal limits.

Fixed logs (S3 Object Lock/WORM), retention by regulatory deadlines.

Secrets: key rotation, dual-control for DR.

Audit trail of all switchovers and restores.

Anti-patterns that break DR

One PSP/one stablecoin network per country - no backup rail.

OLTP and OLAP on the same database - recovery blocks live operations.

No idempotencyKey - debit/payout doubles for retrays.

Backups without a regular restore test are "Schrödinger backup."

Lack of WORM/immutability - vulnerability to insider/malicious deletion.

DNS feilover without short TTLs and heated endpoints.

A single ledger writer in two regions at the same time is state splitting.

Emergency preparedness checklist

Architecture

Multi-AZ for all critical services, documented topology.
DR-region with described role (Active-Active/Passive) and budget.

Data

Postgres: PITR, snapshots, lag monitoring, regular recovery tests.
Kafka/NATS: mirroring/archive, replay plan.
ClickHouse/OLAP: batch backups, restoring samples.
S3: Object Lock (WORM), versions, cross-region.

Applications

Idempotency in money, append-only ledger, balance versioning.
Auto-feature-degrade on incidents (tournaments/missions off).
Canary checks before switching region.

Ticket office and crypt

Two providers per method and two networks for stables.
Routing and switch cause log.
KYT/AML in degrade mode with escalation.

Operations

Runbooks with RACI and attendant phones.
Monthly Chaos days and quarterly Full-DR drills.
Communication templates (support, partners, regulator).

Observability

RTO/RPO dashboards, DB role alerts, lags, bid/pay failures.
Audit log of switches and restores.

Reliability iGaming is not a "feiler button," but a system of habits: geographical isolation, predictable RTO/RPO, idempotent money, multi-rail cash desk, immutable backups, regular exercises and transparent communication. This discipline allows you to experience failures without losses in the ledger, without "stuck" rounds and without hitting the trust of players and regulators.