WinUpGo
Search
CASWINO
SKYSLOTS
BRAMA
TETHERPAY
777 FREE SPINS + 300%
Cryptocurrency casino Crypto Casino Torrent Gear is your all-purpose torrent search! Torrent Gear

How to build fail-safe processing of millions of transactions per day

Full article

💡 Technical material for product and platform teams of fintech/gaming and related industries. Not a call to play. By "transactions" we mean monetary/accounting transactions (debiting, crediting, transfer, settlement, return).

1) What does fail-safe mean for transactions

Fail-safe is when any failed situation leads to either a safe stop or a compensated state without losing money and data. Objectives:
  • "Double debits/credits" = 0.
  • Lost Transactions/Events = 0.
  • Predictable SLO by latency/delivery, clear degradation modes and DR.

Basis - monetary invariants (true balance in one place), idempotence, agreed delivery of events.


2) Architectural principles (short)

1. Single source of truth: balance sheet and accounting - in Ledger/Wallet. Services around hold the state of processes, not money.

2. Idempotency everywhere: all "write" operations take'Idempotency-Key '; repeat returns the same result.

3. Event with delivery guarantee: outbox/CDC, queues, DLQ, deadup.

4. Sagas and compensations, not "manual edits."

5. Back-pressure and priorities: the system slows down, but does not collapse.

6. Default observability: structured logs, tracing, metrics.

7. Multi-region and DR: asset-asset/asset-liability, regular exercise.


3) Reference topology


Edge/API GW ──Command API ──App Service (Sagas)
│           │
│         (Outbox TX)

RateLimit     Outbox Table ──Publisher ──Kafka/Pulsar ──Consumers
│                      │
WAF                     └─DLQ/Replay
│
└─Ledger/Wallet (ACID, idempotent debit/credit)
│
└─CDC/Changefeed ──DWH/BI/Recon

Key places: Outbox (atomic record of a team and a "draft" of an event), Publisher (exactly one delivery), Consumers (idempotent, with a dedup key), DLQ/Replay (controlled repetitions).


4) Monetary invariants and consistency

True by balance - Ledger (ACID, serializable transactions or strict ordering by account).

Money commands: 'debit', 'credit', 'hold', 'commit', 'rollback' are idempotent.

Combined processes are built like sagas:
  • 'authorize → settle → credit ', 'request → submit → settled/failed', 'refund/void '.
  • No direct balance sheet edits bypassing Ledger.

5) Idempotence: key design

The key must uniquely identify the business transaction:
  • `bet_id+amount+currency`, `payment_intent+capture_id`, `payout_id`, `chain_txid`.
  • Store the result by key (response cache). Repeat with same key → same body/status.
  • Monitor mismatch - same key with different amount → 'IDEMPOTENCY _ MISMATCH '.

6) Queues, order and deadup

Exactly-once effects are achieved not by transport, but by idempotent consumers + dedup storage (LRU/Redis/DB c TTL).

Keep the key order (partition key = 'account _ id/round _ id/player _ id').

For "heterogeneous" keys - state machine per entity.

DLQ is mandatory: after N attempts - into an isolated topic with a human-readable cause.


7) Outbox/CDC: Why events "don't get lost"

Within the framework of one transaction, we record both a business change and an outbox entry in the service database.

A separate publisher reads the outbox and publishes it to the confirmation bus.

Alternatively, CDC (Change Data Capture) at the database level (Debezium/replication log).

No "event logs" past the transaction are a source of loss.


8) Back-pressure and priorities

Token buckets and input quotas (per tenant/brand/region).

Priority queues: money paths above promo/telemetry.

When overloaded: modes' no new sessions/requests', freezing secondary features, saving the kernel.

Auto-degradation: cut the frequency of background tasks, dynamically expand critical workers.


9) Multi-regional sustainability

Asset for API and queues, local Ledger (or global with region/currency sharding).

Data residency: Money/PII/logs are not crossed without explicit rules.

Event replication is interregional - asynchronous, marked 'region'.

RPO/RTO: aim RPO ≤ 5 minutes, RTO ≤ 30 minutes; check regularly.


10) SLO/SLI and dashboards

Landmarks (example):
  • p95'authorize/debit/credit '<150-300 ms (internal path).
  • p95 end-to-end "bus komanda→sobytiye" <1-2 s.
  • Delivery of webhooks/external events p99 <5 min.
  • Lost/Duplicated Transactions = 0 (contract checks).

Metrics: latency p50/p95/p99, error-rate (4xx/5xx/business), consumer/queue lag, retry storms, settle lag, webhook lag, DLQ size, 'IDEMPOTENCY _ MISMATCH' frequency.


11) Observability and audit

Structured JSON logs with 'trace _ id', 'idempotency _ key', business ID, error codes.

OpenTelemetry: HTTP/gRPC/DB/bus tracing, spans of sagas.

WORM audit: unchangeable critical change logs (limits, keys, promo/jackpot configs).

PII/secret masking, regional buckets, RBAC/ABAC for log access.


12) Reliability testing

Contract tests: repetition/duplicates, out-of-order, idempotency, dedup.

Load: peak profile (x10), stability of queues and DB.

Chaos cases: Ledger/wallet drop, queue/regions dump, CDC delays, retray "storm"

Game Days: regular DR drills and incidents, with MTTR measured.


13) Storage and data

OLTP for money: transactional database (RPO≈0), strict indexes, serializable levels for critical entities.

Cache (Redis) - only for acceleration, not for "truth." TTL + jitter, cache stampede protection.

OLAP/DWH - for reports/analytics. Flows from CDC/bus, no load on OLTP.

Data schemas are versioned; migration without downtime (expand/contract).


14) Orchestration of Retraces

Exponential backoff + jitter, deadlines/timeout on RPC.

Idempotent repeat on each layer (client → service → consumer).

Retrai quotas, protect against "storms" (circuit breaker, hedged requests where appropriate).

Replay from DLQ only to "safe" windows, with speed limit.


15) Safety of transports

mTLS everywhere S2S, short-lived tokens (OAuth2 CC), body signatures (HMAC/EdDSA) for webhooks.

Secrets in Vault/HSM, rotation, keys per brand/region.

Politicians least privilege, "four eyes" on manual operations.


16) Sample contracts (fragments)

Idempotent Debit Command


POST /v1/wallet/debit
Headers: X-Idempotency-Key: debit_pi_001, X-Trace-Id: tr_a1b2
{
"account_id":"acc_42",  "amount":{"minor_units":5000,"currency":"EUR"},  "reason":"payout",  "reference_id":"po_001"
}
→ 200 { "status":"committed", "entry_id":"e_77" }
(repeat → same answer)

Event from outbox

json
{
"event_id":"uuid",  "event_type":"wallet. debit. committed",  "occurred_at":"2025-10-23T16:21:05Z",  "account_id":"acc_42",  "amount_minor":5000,  "currency":"EUR",  "reference_id":"po_001",  "idempotency_key":"debit_pi_001",  "schema_version":"1. 3. 0"
}

17) Checklists

Platform/Operator

  • True on balance - one Ledger; there are no workarounds.
  • All write operations with'Idempotency-Key '; key response is stored.
  • Outbox/CDC to all domain records, DLQ and managed replay.
  • Priority queues, back-pressure, degradation modes.
  • Partition-keys are selected by business keys; consumers are idempotent.
  • SLO dashboards, OpenTelemetry, WORM audit.
  • Regular DR/xaoc exercises, contract/load tests.
  • Data residency, encryption, Vault/HSM, key rotation.

Providers/Integrations

  • Sending Trace-Id/Idempotency-Key, ready for redelivery.
  • Webhooks are signed and deduplicated.
  • Versions of schemes/contracts are observed (semver, deprecation).

18) Red flags (anti-patterns)

The balance changes by webhook without a command in Ledger.

Lack of idempotency → double write-offs/credits.

Publishing events bypassing outbox/CDC.

Monolith without back-pressure: peak traffic brings everything down.

Mixing OLTP and reports: BI hits the combat database.

Absence of DLQ/replay; "quiet" ingestion of errors.

No regional PII/money isolation; shared keys across multiple brands.

Manual edits of balances/statuses in the database.


19) The bottom line

Fail-safe processing of millions of transactions per day is about invariants and discipline: a single source of truth, idempotent commands, sagas and outbox/CDC, order and deadup in queues, observability and managed degradation. Add access mandates, DR practices and regular exercises - and get a system where money moves quickly and only once, events are not lost, and traffic growth and disruptions become manageable risks, not surprises.

× Search by games
Enter at least 3 characters to start the search.