How to build fail-safe processing of millions of transactions per day
Full article
1) What does fail-safe mean for transactions
Fail-safe is when any failed situation leads to either a safe stop or a compensated state without losing money and data. Objectives:- "Double debits/credits" = 0.
- Lost Transactions/Events = 0.
- Predictable SLO by latency/delivery, clear degradation modes and DR.
Basis - monetary invariants (true balance in one place), idempotence, agreed delivery of events.
2) Architectural principles (short)
1. Single source of truth: balance sheet and accounting - in Ledger/Wallet. Services around hold the state of processes, not money.
2. Idempotency everywhere: all "write" operations take'Idempotency-Key '; repeat returns the same result.
3. Event with delivery guarantee: outbox/CDC, queues, DLQ, deadup.
4. Sagas and compensations, not "manual edits."
5. Back-pressure and priorities: the system slows down, but does not collapse.
6. Default observability: structured logs, tracing, metrics.
7. Multi-region and DR: asset-asset/asset-liability, regular exercise.
3) Reference topology
Edge/API GW ──Command API ──App Service (Sagas)
│ │
│ (Outbox TX)
RateLimit Outbox Table ──Publisher ──Kafka/Pulsar ──Consumers
│ │
WAF └─DLQ/Replay
│
└─Ledger/Wallet (ACID, idempotent debit/credit)
│
└─CDC/Changefeed ──DWH/BI/Recon
Key places: Outbox (atomic record of a team and a "draft" of an event), Publisher (exactly one delivery), Consumers (idempotent, with a dedup key), DLQ/Replay (controlled repetitions).
4) Monetary invariants and consistency
True by balance - Ledger (ACID, serializable transactions or strict ordering by account).
Money commands: 'debit', 'credit', 'hold', 'commit', 'rollback' are idempotent.
Combined processes are built like sagas:- 'authorize → settle → credit ', 'request → submit → settled/failed', 'refund/void '.
- No direct balance sheet edits bypassing Ledger.
5) Idempotence: key design
The key must uniquely identify the business transaction:- `bet_id+amount+currency`, `payment_intent+capture_id`, `payout_id`, `chain_txid`.
- Store the result by key (response cache). Repeat with same key → same body/status.
- Monitor mismatch - same key with different amount → 'IDEMPOTENCY _ MISMATCH '.
6) Queues, order and deadup
Exactly-once effects are achieved not by transport, but by idempotent consumers + dedup storage (LRU/Redis/DB c TTL).
Keep the key order (partition key = 'account _ id/round _ id/player _ id').
For "heterogeneous" keys - state machine per entity.
DLQ is mandatory: after N attempts - into an isolated topic with a human-readable cause.
7) Outbox/CDC: Why events "don't get lost"
Within the framework of one transaction, we record both a business change and an outbox entry in the service database.
A separate publisher reads the outbox and publishes it to the confirmation bus.
Alternatively, CDC (Change Data Capture) at the database level (Debezium/replication log).
No "event logs" past the transaction are a source of loss.
8) Back-pressure and priorities
Token buckets and input quotas (per tenant/brand/region).
Priority queues: money paths above promo/telemetry.
When overloaded: modes' no new sessions/requests', freezing secondary features, saving the kernel.
Auto-degradation: cut the frequency of background tasks, dynamically expand critical workers.
9) Multi-regional sustainability
Asset for API and queues, local Ledger (or global with region/currency sharding).
Data residency: Money/PII/logs are not crossed without explicit rules.
Event replication is interregional - asynchronous, marked 'region'.
RPO/RTO: aim RPO ≤ 5 minutes, RTO ≤ 30 minutes; check regularly.
10) SLO/SLI and dashboards
Landmarks (example):- p95'authorize/debit/credit '<150-300 ms (internal path).
- p95 end-to-end "bus komanda→sobytiye" <1-2 s.
- Delivery of webhooks/external events p99 <5 min.
- Lost/Duplicated Transactions = 0 (contract checks).
Metrics: latency p50/p95/p99, error-rate (4xx/5xx/business), consumer/queue lag, retry storms, settle lag, webhook lag, DLQ size, 'IDEMPOTENCY _ MISMATCH' frequency.
11) Observability and audit
Structured JSON logs with 'trace _ id', 'idempotency _ key', business ID, error codes.
OpenTelemetry: HTTP/gRPC/DB/bus tracing, spans of sagas.
WORM audit: unchangeable critical change logs (limits, keys, promo/jackpot configs).
PII/secret masking, regional buckets, RBAC/ABAC for log access.
12) Reliability testing
Contract tests: repetition/duplicates, out-of-order, idempotency, dedup.
Load: peak profile (x10), stability of queues and DB.
Chaos cases: Ledger/wallet drop, queue/regions dump, CDC delays, retray "storm"
Game Days: regular DR drills and incidents, with MTTR measured.
13) Storage and data
OLTP for money: transactional database (RPO≈0), strict indexes, serializable levels for critical entities.
Cache (Redis) - only for acceleration, not for "truth." TTL + jitter, cache stampede protection.
OLAP/DWH - for reports/analytics. Flows from CDC/bus, no load on OLTP.
Data schemas are versioned; migration without downtime (expand/contract).
14) Orchestration of Retraces
Exponential backoff + jitter, deadlines/timeout on RPC.
Idempotent repeat on each layer (client → service → consumer).
Retrai quotas, protect against "storms" (circuit breaker, hedged requests where appropriate).
Replay from DLQ only to "safe" windows, with speed limit.
15) Safety of transports
mTLS everywhere S2S, short-lived tokens (OAuth2 CC), body signatures (HMAC/EdDSA) for webhooks.
Secrets in Vault/HSM, rotation, keys per brand/region.
Politicians least privilege, "four eyes" on manual operations.
16) Sample contracts (fragments)
Idempotent Debit Command
POST /v1/wallet/debit
Headers: X-Idempotency-Key: debit_pi_001, X-Trace-Id: tr_a1b2
{
"account_id":"acc_42", "amount":{"minor_units":5000,"currency":"EUR"}, "reason":"payout", "reference_id":"po_001"
}
→ 200 { "status":"committed", "entry_id":"e_77" }
(repeat → same answer)
Event from outbox
json
{
"event_id":"uuid", "event_type":"wallet. debit. committed", "occurred_at":"2025-10-23T16:21:05Z", "account_id":"acc_42", "amount_minor":5000, "currency":"EUR", "reference_id":"po_001", "idempotency_key":"debit_pi_001", "schema_version":"1. 3. 0"
}
17) Checklists
Platform/Operator
- True on balance - one Ledger; there are no workarounds.
- All write operations with'Idempotency-Key '; key response is stored.
- Outbox/CDC to all domain records, DLQ and managed replay.
- Priority queues, back-pressure, degradation modes.
- Partition-keys are selected by business keys; consumers are idempotent.
- SLO dashboards, OpenTelemetry, WORM audit.
- Regular DR/xaoc exercises, contract/load tests.
- Data residency, encryption, Vault/HSM, key rotation.
Providers/Integrations
- Sending Trace-Id/Idempotency-Key, ready for redelivery.
- Webhooks are signed and deduplicated.
- Versions of schemes/contracts are observed (semver, deprecation).
18) Red flags (anti-patterns)
The balance changes by webhook without a command in Ledger.
Lack of idempotency → double write-offs/credits.
Publishing events bypassing outbox/CDC.
Monolith without back-pressure: peak traffic brings everything down.
Mixing OLTP and reports: BI hits the combat database.
Absence of DLQ/replay; "quiet" ingestion of errors.
No regional PII/money isolation; shared keys across multiple brands.
Manual edits of balances/statuses in the database.
19) The bottom line
Fail-safe processing of millions of transactions per day is about invariants and discipline: a single source of truth, idempotent commands, sagas and outbox/CDC, order and deadup in queues, observability and managed degradation. Add access mandates, DR practices and regular exercises - and get a system where money moves quickly and only once, events are not lost, and traffic growth and disruptions become manageable risks, not surprises.