Casino 24/7 and on-call practices
1) Goals of 24/7 operations
Business SLO: login ≥ 99. 9%, deposit ≥ 99. 85%, rate/settlement ≥ 99. 9%, p95 WS RTT ≤ 120 ms.
Incident targets: MTTD ≤ 1 min (synthetic), MTTR ≤ 15-30 min for cash flow.
Quality of support: <3% of tickets go on the second day without response, CSAT of support ≥ 90%.
2) On-call organization: models and schedules
Models
Follow-the-sun: 3 geo-teams (Europe/Americas/APAC), minimum night load.
Night rotation in the region: a week of night shifts per person once every N weeks (compensation/time off).
Cell-based: duty by product cell (brands/markets) + total L1.
Roles in shift
L1 On-call (Incident Commander by default) - accepts alert, coordinates, keeps in touch with support.
L2 Domain engineers - payments, game-gateway/WS, database/wallet, platform SRE.
Comms officer - status page, partners/providers, internal updates.
Duty Manager - business escalation, prioritization, exceptions (VIP/regulator).
Shift template (12 × 7 or 8 × 5 + shifts)
Shift: 8/10/12 hours. Shift change 15-30 min "warm handover."
Follow the rule of 2 consecutive nights maximum and no more than 7 on-call-days in a 14-day window.
Each shift has a Roster: duty, reserve, call manager, contact L2.
3) Classification of incidents and SLAs
4) Alerting without noise
Principles: symptomatic SLO alerts → causal resource → context.
Симптомы: `login_success_ratio↓`, `deposit_success_by_psp↓`, `ws_rtt_p95↑`, `game_launch_success↓`.
Причины: `db_conn_saturation↑`, `queue_lag↑`, `psp_timeout↑`, `provider_launch_latency↑`.
Noise protection: required sequential violations ≥ 3, auto-supress on release, deduplication and grouping.
Duty set: critical - PagerDuty/Opsgenie; the rest is Slack/mail.
Alert text: "What/Where/How much/Action." Example:5) Runbook 'and escalations
Runbook Mini Template
1. Detection: links to dashboards (SLO, causal), trace, logs.
2. Quick checks: health PSP/providers, DR-region synthetics, DB/cache status.
3. Temporary measures: feature-flags/kill-switch, rate-limits, PSP/provider switching, degradation of heavy features.
4. Escalation: who L2/L3, contacts 24 × 7 provider.
5. Green zone criteria: SLO normal N minutes, queues 6. Comms: status template, affected markets/brands, ETA/next update. T0-5 min: L1 accepts, assigns IC, starts runbook. T5-10 min: we call the profile L2 + Comms officer. T10-15 min: Duty Manager/product, legal/compliance if necessary. External: PSP/Game provider - according to the regulations (SLA channel, ticket, call). 6) Communications and status page Internal updates every 10-15 minutes for SEV-1/2 (# war-room channel, message template). Status page: current status, affected markets, interim measures, next update in X min. Post-incident note for support/affiliates/partners: what happened, how to compensate. Templates in advance: short, no "inner kitchen," no guilt. 7) Working with external dependencies (PSP/games/CDN) Contact directory 24 × 7: PSP A/B, game providers, CDN/WAF, cloud. SLA monitoring: synthetics on deposits/launching games, automatic ticket triggers. Failover policies: route to PSP-B at 'success <99% 10 min', switching game provider at 'TTFS> 800ms'. Inbox webhooks: HMAC signature, idempotency, re-play from the queue after provider degradation. 8) GameDay and workouts Weekly tabletop exercises (30-45 minutes): reading graphs, making decisions. Monthly technical DR-drives (60-90 min): PSP failure, provider lag, WS database/cluster drop. Exercise KPI: time to recognize the cause, quality of communications, correctness of decisions on phicheflags. 9) Handover and documentation 10) On-call health and sustainability Rule 8/8/8: work/sleep/personal. Night shifts → time off. Buddy system for beginners, shadow duty 2-3 weeks. Psychological safety: "blameless" retro, support for serious incidents. Load audit: ≤ 2 "awakenings" per night on average per engineer - target; above → recycling of the alert/architecture. 11) Operational Performance Metrics MTTD/MTTR by domain (login/deposit/WS/games). Alert quality:% noisy/closed no action, average number of alerts/shift. Change failure rate:% of incidents caused by releases; mean time between failures. Toil: share of repeatable manual tasks → automation plan. Provider impact: share of SEV-2/1 due to external partners (argument for SLA/migration). 12) Tools and panels of the "attendant" "Red" dashboard SLO: login/deposit/bets/launch games, 5xx/429, p95, regions. Causal panels: DB/queues/cache, PSP/providers, CDN/WAF. On-call dispatcher: active incidents, update timers, one-click links to runbook and phicheflags. Timeline - who did what, when, with reference to SLO. 13) Typical scenarios and quick fixes Actions: canary marshrut→ PSP-B 50%; raise the timeout of webhooks; Include JS Challenge in WAF from bots. Comms: "Degradation DE deposits via PSP-A" status page. Output: success ≥ 99% 15 min, retray queue B. Rise of p95 WS in APAC live games Actions: increase the replicas of WS gateways, turn on the warm pool of nodes; rate-limit broadcast messages; Provider - RTT ticket. Output: p95 WS RTT ≤ 120 ms 20 min. C. Game Provider Lag (TTFS> 1. 2 s) Actions: switch lobby to alternative tables/studios, enable metadata cache; status update. Output: TTFS <800 ms, ↓ complaints. 14) 24/7 Readiness Checklist 15) Post-mortem template (blameless) 1. In brief: what happened when, what SEV, impact and scope. 2. Time line: detection → escalation → action → stabilization. 3. Root causes: those/processes/people/suppliers (5 Why). 4. What worked/what didn't: alerts, ranbooks, communications. 5. Action items: technical, process, partner - responsible and deadlines. 6. Prevention: tests/monitoring/drills, SLO/alert changes. Successful 24/7 casino operations are SLO discipline, properly designed alarming without noise, clear runbooks and escalations, regular exercises and respect for on-call people. Link SLO panels to fast levers (phicheflags, PSP/provider switching, degradation of heavy features), maintain communications with players and partners, measure efficiency (MTTD/MTTR/alert quality) - and your platform will be stable around the clock, and the team - productive and stable.Escalator ladder
A. Deposits fall in DE at PSP-A
Resume Summary