Observability: metrics, logs, tracing in iGaming
1) Why observability is in iGaming
Players are sensitive to real-time delays and crashes (live games, bets, tournaments). Any degradation of the login/deposit/withdrawal hits revenue and trust. Observability shall:- Provide a snapshot of L3-L7, applications, and business
- quickly localize bottlenecks between the front, APIs, game providers, payments;
- clearly separate product files (it is impossible to bet) from "beautiful" technical metrics.
Key: start with SLO (service level objects) product flow, and only then select metrics/logs/traces.
2) Product SLOs and error budget
Examples of SLOs (over 30 days):- Login: success ≥ 99. 90%, p95 latency ≤ 250 ms.
- Deposit ('/payments/deposit ') and conclusion: success ≥ 99. 85%, p95 ≤ 400 ms.
- Real time bet: success ≥ 99. 9%, p95 WS messages ≤ 120 ms.
- Starting a slot/session of a live game: success ≥ 99. 8%, p95 ≤ 800 ms.
Error budget is translated into release policy: if> 50% is used up - stop-feature/canary deposit only;> 80% - only bug fixes.
3) Telemetry's "Three Whales"
Metrics (state quantification)
RED for custom APIs: Rate, Errors, Duration for each endpoint/method.
USE for infrastructure: Utilization, Saturation, Errors (CPU, memory, IO, connections, queues).
Business metrics: registratsii→depozit conversion, success rate, number of active live casino tables, average quotation delay.
Logs (facts and context)
Structured JSON events with required fields: 'ts', 'level', 'service', 'env', 'trace _ id', 'span _ id', 'user _ id' (pseudonymized), 'session _ id', 'route', 'status', 'latency _ ms', 'amount', 'currency', 'provider'.
Categories: audit (changes in rights/balance), business events (rate, deposit), errors (stack/code), technical support (warn/info).
Tracing (Cause & Effect)
End-to-end via front → API → risk engine → game providers/payments → queues/databases.
Wide error sampling (100%), adaptive sampling of "slow" requests (e.g. p95 +), by default 1-5% success traffic.
4) Metrics design: what to shoot and what to call
Examples of Prometheus metrics (pseudo):
RED по платежам counter ig_payments_requests_total{route="/payments/deposit",method="POST",provider="card"}
counter ig_payments_errors_total{route="/payments/deposit",code="5xx",provider="card"}
hist ig_payments_latency_seconds_bucket{route="/payments/deposit",le="0. 25"}
gauge ig_wallet_balance_anomalies{reason="negative_after_loss"}
Бизнес counter ig_bet_placed_total{game="slot",provider="PragmaticPlay",currency="EUR"}
hist ig_bet_rtt_ms_bucket{game="live_blackjack",le="100"}
gauge ig_active_tables{provider="Evolution",market="EU"}
Rules:
- A single ontology of labels: 'env', 'region', 'market', 'provider', 'route', 'game', 'payment _ method'.
- Do not blow up cardinality: limit 'user _ id' in metrics (only in logs/tracks).
5) Logs: structure, privacy, retention
Minimum JSON for critical actions:json
{
"ts":"2025-10-23T17:41:26. 123Z, "" level ":" INFO, "" service ":" payments-api, "" env ":" prod, "" trace_id":"b3f7"..., "" span_id":"ab12"..., "user_pid":"u_9fd"... ,//alias, not email/phone
"session_id":"s_78a…", "route":"/payments/deposit", "status":200, "latency_ms":182, "amount":100. 0, "currency":"EUR", "provider":"card", "bin_country":"DE"
}
Practices:
- Mask/exclude PAN/CVV, tokens, passwords, JWT - even in debug.
- Bind logs to traces ('trace _ id') and to the customer (alias' user _ pid ').
- TTL: "noisy" technologists 14-30 days, audit trail 1-3 years (by policy and law), business logs 6-24 months (pseudonymized).
- WORM/immunity for audit (unchanging buckets), ACL by role.
6) Tracing: from front to provider
Extended flow
Login/registration → anti-bots/WAF → Auth-API → profile/wallet.
Deposit → Payment-API → provider → webhooks → Wallet-service.
Bet → Game-gateway (WebSocket) → the game provider → calculating the winnings of → Wallet.
Tactics
OpenTelemetry is everywhere: SDK at the front (XHR/Fetch), on the mobile, in the API, in the workers.
Context protocols: W3C traceparent/tracestate; flick through gRPC/HTTP/WebSocket (in WS - in the first metadata/messages).
Adaptive sampling: 100% for errors, ≥50% for payment conclusions, ≥10% for "new" releases/canaries, 1-5% background.
Visual tags in the trace view: 'risk _ decision', 'provider _ name', 'bonus _ id', 'jackpot _ round'.
7) Real-time channels: WebSocket/WebRTC
Метрики: `ws_connected_sessions`, `ws_messages_in_flight`, `ws_send_latency_ms`, `ws_disconnect_reason`.
Trace events: 'ws _ subscribe _ table', 'ws _ bet _ place', 'ws _ settlement'.
Logs: normalize message size/frequency; track "empty pings" and flood patterns.
For WebRTC (live casino): 'jitter _ ms', 'packet _ loss', 'round _ trip _ time _ ms', 'keyframe _ interval _ s'.
8) Alerting: from symptoms to causes
Symptomatic alerts (SLO/SLA):- Login SLI error> 0. 3% in 5 min.
- p95 '/payments/deposit '> 400 ms 10 min in a row.
- Betting success <99. 7% in 15 min.
- `db_connections_saturation > 0. 85` 5 мин; `queue_lag_seconds > 30`.
- The '429 '/' 5xx' burst from one ASN → the signal to the WAF/bot manager.
- Allerts only in persistent impairment; auto-jamming of duplicates; routes to runbooks.
9) Dashboards that really help
"Deposit Flow"
Funnel: request → redirect to the provider → floppy → wallet update.
Success/errors by provider, BIN country map, p95/99 latency, distribution of error codes.
"Live Games/Bets"
Active tables, online players, p95 WS delays, share timeouts/aborts, top error games.
"API Health"
RED on key routes, 4xx/5xx, connections pool saturations/CPU/GC, top N slow endpoints (with links in the trace).
10) Cost and storage: how not to go broke
Cardinality budget: limits on labels/attributes; PR reviews that add metrics.
Tiered storage: hot 3-7 days (quick search), warm 30-90 days (S3/object), cold archive (less often).
Downsampling metrics (1s → 10s → 1m) and rolling aggregation.
Deduplication of logs from retrays and idempotent calls.
11) Privacy and compliance (short)
Pseudonymize 'user _ id', do not store e-mail, phone, passport in the logs.
Encrypt transport (mTLS) and rest, differentiate accesses (RBAC/MFA), maintain data access logs.
TTL/retention as in the data matrix; "right to delete" is implemented through deactivation flags and pseudonymization in historical sets.
12) Incidents and trace debugging: quick recipe
1. A symptomatic alert (deposit success) worked.
2. Dashboard showed a surge of one provider each.
3. Click in the trace view: a long step on 'provider _ callback' (p99 2. 3 s), many retras.
4. Logs: 'timeout' + ASN = bot pattern hosting.
5. Action: raised timeouts on the colback, included JS challenge in WAF for ASN, limited retras.
6. Retro: added SLI on 'callback _ success _ ratio', alert on 'queue _ lag _ seconds'.
13) Implementation by phase
1. SLO design for 4-6 critical flow (login, deposit, output, game launch, bet).
2. RED/USE + business SLI metrics; single label scheme.
3. Structural logs with 'trace _ id'; masking sensitive fields.
4. OpenTelemetry is everywhere; adaptive sampling.
5. Dashboards + alerts (symptomatic and causal), runbooks.
6. Cost management: cardinality, downsampling, storage levels.
7. Exercises: GameDay scenarios (payment drop, provider lag, WS surge).
8. Continuous improvement: add SLI when new features appear, close the "blind spots."
14) Check list (prod-ready)
- SLO/SLI approved, error budget in release policy.
- RED/USE metrics + business metrics with a single label ontology.
- JSON logs, masking secrets, 'trace _ id' in each message.
- End-to-end tracing (HTTP/gRPC/WebSocket/WebRTC), W3C context.
- Alerts are symptomatic and causal, without noise, links in runbooks.
- Dashboards for deposits, rates, API health; quick filters by'provider/market '.
- Sampling/cardinality under control, tiered storage.
- Privacy: Aliasing, encryption, RBAC/MFA, meta logs.
- Drills and retro, regular SLO revision.
Resume Summary
The observability of iGaming is not "CPU graphics," but a real-time product picture: SLO critical flow, RED/USE metrics, coherent logs and traces through the player's entire path and money. Add the discipline of alert on an erroneous budget, control the cost of telemetry, observe privacy - and the team will not guess, but see the causes of problems and fix them before the players notice it.