Observability: metrics, logs, tracing in iGaming

1) Why observability is in iGaming

Players are sensitive to real-time delays and crashes (live games, bets, tournaments). Any degradation of the login/deposit/withdrawal hits revenue and trust. Observability shall:

Provide a snapshot of L3-L7, applications, and business
quickly localize bottlenecks between the front, APIs, game providers, payments;
clearly separate product files (it is impossible to bet) from "beautiful" technical metrics.

Key: start with SLO (service level objects) product flow, and only then select metrics/logs/traces.

2) Product SLOs and error budget

Examples of SLOs (over 30 days):

Login: success ≥ 99. 90%, p95 latency ≤ 250 ms.
Deposit ('/payments/deposit ') and conclusion: success ≥ 99. 85%, p95 ≤ 400 ms.
Real time bet: success ≥ 99. 9%, p95 WS messages ≤ 120 ms.
Starting a slot/session of a live game: success ≥ 99. 8%, p95 ≤ 800 ms.

Error budget is translated into release policy: if> 50% is used up - stop-feature/canary deposit only;> 80% - only bug fixes.

3) Telemetry's "Three Whales"

Metrics (state quantification)

RED for custom APIs: Rate, Errors, Duration for each endpoint/method.

USE for infrastructure: Utilization, Saturation, Errors (CPU, memory, IO, connections, queues).

Business metrics: registratsii→depozit conversion, success rate, number of active live casino tables, average quotation delay.

Logs (facts and context)

Structured JSON events with required fields: 'ts', 'level', 'service', 'env', 'trace _ id', 'span _ id', 'user _ id' (pseudonymized), 'session _ id', 'route', 'status', 'latency _ ms', 'amount', 'currency', 'provider'.

Categories: audit (changes in rights/balance), business events (rate, deposit), errors (stack/code), technical support (warn/info).

Tracing (Cause & Effect)

End-to-end via front → API → risk engine → game providers/payments → queues/databases.

Wide error sampling (100%), adaptive sampling of "slow" requests (e.g. p95 +), by default 1-5% success traffic.

4) Metrics design: what to shoot and what to call

Examples of Prometheus metrics (pseudo):


RED по платежам counter ig_payments_requests_total{route="/payments/deposit",method="POST",provider="card"}
counter ig_payments_errors_total{route="/payments/deposit",code="5xx",provider="card"}
hist   ig_payments_latency_seconds_bucket{route="/payments/deposit",le="0. 25"}
gauge  ig_wallet_balance_anomalies{reason="negative_after_loss"}

Бизнес counter ig_bet_placed_total{game="slot",provider="PragmaticPlay",currency="EUR"}
hist   ig_bet_rtt_ms_bucket{game="live_blackjack",le="100"}
gauge  ig_active_tables{provider="Evolution",market="EU"}

Rules:

A single ontology of labels: 'env', 'region', 'market', 'provider', 'route', 'game', 'payment _ method'.
Do not blow up cardinality: limit 'user _ id' in metrics (only in logs/tracks).

5) Logs: structure, privacy, retention

Minimum JSON for critical actions:

json
{
"ts":"2025-10-23T17:41:26. 123Z, "" level ":" INFO, "" service ":" payments-api, "" env ":" prod, "" trace_id":"b3f7"..., "" span_id":"ab12"..., "user_pid":"u_9fd"... ,//alias, not email/phone
"session_id":"s_78a…",  "route":"/payments/deposit",  "status":200,  "latency_ms":182,  "amount":100. 0,  "currency":"EUR",  "provider":"card",  "bin_country":"DE"
}

Practices:

Mask/exclude PAN/CVV, tokens, passwords, JWT - even in debug.
Bind logs to traces ('trace _ id') and to the customer (alias' user _ pid ').
TTL: "noisy" technologists 14-30 days, audit trail 1-3 years (by policy and law), business logs 6-24 months (pseudonymized).
WORM/immunity for audit (unchanging buckets), ACL by role.

6) Tracing: from front to provider

Extended flow

Deposit → Payment-API → provider → webhooks → Wallet-service.

Bet → Game-gateway (WebSocket) → the game provider → calculating the winnings of → Wallet.

Tactics

OpenTelemetry is everywhere: SDK at the front (XHR/Fetch), on the mobile, in the API, in the workers.

Context protocols: W3C traceparent/tracestate; flick through gRPC/HTTP/WebSocket (in WS - in the first metadata/messages).

Adaptive sampling: 100% for errors, ≥50% for payment conclusions, ≥10% for "new" releases/canaries, 1-5% background.

Visual tags in the trace view: 'risk _ decision', 'provider _ name', 'bonus _ id', 'jackpot _ round'.

7) Real-time channels: WebSocket/WebRTC

Метрики: `ws_connected_sessions`, `ws_messages_in_flight`, `ws_send_latency_ms`, `ws_disconnect_reason`.

Trace events: 'ws _ subscribe _ table', 'ws _ bet _ place', 'ws _ settlement'.

Logs: normalize message size/frequency; track "empty pings" and flood patterns.

For WebRTC (live casino): 'jitter _ ms', 'packet _ loss', 'round _ trip _ time _ ms', 'keyframe _ interval _ s'.

8) Alerting: from symptoms to causes

Symptomatic alerts (SLO/SLA):

Login SLI error> 0. 3% in 5 min.
p95 '/payments/deposit '> 400 ms 10 min in a row.
Betting success <99. 7% in 15 min.

Causal/Resource:

`db_connections_saturation > 0. 85` 5 мин; `queue_lag_seconds > 30`.
The '429 '/' 5xx' burst from one ASN → the signal to the WAF/bot manager.

Noise cancellation:

Allerts only in persistent impairment; auto-jamming of duplicates; routes to runbooks.

9) Dashboards that really help

"Deposit Flow"

Funnel: request → redirect to the provider → floppy → wallet update.

Success/errors by provider, BIN country map, p95/99 latency, distribution of error codes.

"Live Games/Bets"

Active tables, online players, p95 WS delays, share timeouts/aborts, top error games.

"API Health"

RED on key routes, 4xx/5xx, connections pool saturations/CPU/GC, top N slow endpoints (with links in the trace).

10) Cost and storage: how not to go broke

Cardinality budget: limits on labels/attributes; PR reviews that add metrics.

Tiered storage: hot 3-7 days (quick search), warm 30-90 days (S3/object), cold archive (less often).

Downsampling metrics (1s → 10s → 1m) and rolling aggregation.

Deduplication of logs from retrays and idempotent calls.

11) Privacy and compliance (short)

Pseudonymize 'user _ id', do not store e-mail, phone, passport in the logs.

Encrypt transport (mTLS) and rest, differentiate accesses (RBAC/MFA), maintain data access logs.

TTL/retention as in the data matrix; "right to delete" is implemented through deactivation flags and pseudonymization in historical sets.

12) Incidents and trace debugging: quick recipe

1. A symptomatic alert (deposit success) worked.

2. Dashboard showed a surge of one provider each.

3. Click in the trace view: a long step on 'provider _ callback' (p99 2. 3 s), many retras.

4. Logs: 'timeout' + ASN = bot pattern hosting.

5. Action: raised timeouts on the colback, included JS challenge in WAF for ASN, limited retras.

6. Retro: added SLI on 'callback _ success _ ratio', alert on 'queue _ lag _ seconds'.

13) Implementation by phase

1. SLO design for 4-6 critical flow (login, deposit, output, game launch, bet).

2. RED/USE + business SLI metrics; single label scheme.

3. Structural logs with 'trace _ id'; masking sensitive fields.

4. OpenTelemetry is everywhere; adaptive sampling.

5. Dashboards + alerts (symptomatic and causal), runbooks.

6. Cost management: cardinality, downsampling, storage levels.

7. Exercises: GameDay scenarios (payment drop, provider lag, WS surge).

8. Continuous improvement: add SLI when new features appear, close the "blind spots."

14) Check list (prod-ready)

SLO/SLI approved, error budget in release policy.
RED/USE metrics + business metrics with a single label ontology.
JSON logs, masking secrets, 'trace _ id' in each message.
End-to-end tracing (HTTP/gRPC/WebSocket/WebRTC), W3C context.
Alerts are symptomatic and causal, without noise, links in runbooks.
Dashboards for deposits, rates, API health; quick filters by'provider/market '.
Sampling/cardinality under control, tiered storage.
Privacy: Aliasing, encryption, RBAC/MFA, meta logs.
Drills and retro, regular SLO revision.

Resume Summary

The observability of iGaming is not "CPU graphics," but a real-time product picture: SLO critical flow, RED/USE metrics, coherent logs and traces through the player's entire path and money. Add the discipline of alert on an erroneous budget, control the cost of telemetry, observe privacy - and the team will not guess, but see the causes of problems and fix them before the players notice it.