Casino 24/7 and on-call practices

1) Goals of 24/7 operations

Business SLO: login ≥ 99. 9%, deposit ≥ 99. 85%, rate/settlement ≥ 99. 9%, p95 WS RTT ≤ 120 ms.

Incident targets: MTTD ≤ 1 min (synthetic), MTTR ≤ 15-30 min for cash flow.

Quality of support: <3% of tickets go on the second day without response, CSAT of support ≥ 90%.

2) On-call organization: models and schedules

Models

Follow-the-sun: 3 geo-teams (Europe/Americas/APAC), minimum night load.

Night rotation in the region: a week of night shifts per person once every N weeks (compensation/time off).

Cell-based: duty by product cell (brands/markets) + total L1.

Roles in shift

L1 On-call (Incident Commander by default) - accepts alert, coordinates, keeps in touch with support.

L2 Domain engineers - payments, game-gateway/WS, database/wallet, platform SRE.

Comms officer - status page, partners/providers, internal updates.

Duty Manager - business escalation, prioritization, exceptions (VIP/regulator).

Shift template (12 × 7 or 8 × 5 + shifts)

Shift: 8/10/12 hours. Shift change 15-30 min "warm handover."

Follow the rule of 2 consecutive nights maximum and no more than 7 on-call-days in a 14-day window.

Each shift has a Roster: duty, reserve, call manager, contact L2.

3) Classification of incidents and SLAs

SEV	Example	Influence	SLA reactions	SLA solutions
SEV-1	Massive deposit failure, login unavailable	Loss of revenue/regulatory risk	≤ 5 min	≤ 30 min to stabilization
SEV-2	High betting delay, games provider lag	Reduced conversion	≤ 10 min	≤ 2 h
SEV-3	Partial failure of promo/reports	Limited impact	≤ 30 min	≤ 8 h
SEV-4	Minor bugs/quality alerts	No immediate impact	According to plan	According to plan

4) Alerting without noise

Principles: symptomatic SLO alerts → causal resource → context.

Симптомы: `login_success_ratio↓`, `deposit_success_by_psp↓`, `ws_rtt_p95↑`, `game_launch_success↓`.

Причины: `db_conn_saturation↑`, `queue_lag↑`, `psp_timeout↑`, `provider_launch_latency↑`.

Noise protection: required sequential violations ≥ 3, auto-supress on release, deduplication and grouping.

Duty set: critical - PagerDuty/Opsgenie; the rest is Slack/mail.

Alert text: "What/Where/How much/Action." Example:

💡 SEV-2: deposit success DE/PSP-A 97. 1% < 99% 10m. Impact: EU. Probable cause: PSP timeout↑. Runbook: `PD-42`.

5) Runbook 'and escalations

Runbook Mini Template

1. Detection: links to dashboards (SLO, causal), trace, logs.

2. Quick checks: health PSP/providers, DR-region synthetics, DB/cache status.

3. Temporary measures: feature-flags/kill-switch, rate-limits, PSP/provider switching, degradation of heavy features.

4. Escalation: who L2/L3, contacts 24 × 7 provider.

5. Green zone criteria: SLO normal N minutes, queues

6. Comms: status template, affected markets/brands, ETA/next update.

Escalator ladder

T0-5 min: L1 accepts, assigns IC, starts runbook.

T5-10 min: we call the profile L2 + Comms officer.

T10-15 min: Duty Manager/product, legal/compliance if necessary.

External: PSP/Game provider - according to the regulations (SLA channel, ticket, call).

6) Communications and status page

Internal updates every 10-15 minutes for SEV-1/2 (# war-room channel, message template).

Status page: current status, affected markets, interim measures, next update in X min.

Post-incident note for support/affiliates/partners: what happened, how to compensate.

Templates in advance: short, no "inner kitchen," no guilt.

7) Working with external dependencies (PSP/games/CDN)

Contact directory 24 × 7: PSP A/B, game providers, CDN/WAF, cloud.

SLA monitoring: synthetics on deposits/launching games, automatic ticket triggers.

Failover policies: route to PSP-B at 'success <99% 10 min', switching game provider at 'TTFS> 800ms'.

Inbox webhooks: HMAC signature, idempotency, re-play from the queue after provider degradation.

8) GameDay and workouts

Weekly tabletop exercises (30-45 minutes): reading graphs, making decisions.

Monthly technical DR-drives (60-90 min): PSP failure, provider lag, WS database/cluster drop.

Exercise KPI: time to recognize the cause, quality of communications, correctness of decisions on phicheflags.

9) Handover and documentation

Warm handover checklist (15-20 min):

Current risks (lags growth, PSP limits, hot releases).
Empty tickets/escalations.
Temporary phicheflags/limits and when to withdraw.
Summary of shift incidents (SEV/time/actions/residual risks).
Documentation: live database of runbooks, contacts, schemes, "flow card" money/games.

10) On-call health and sustainability

Rule 8/8/8: work/sleep/personal. Night shifts → time off.

Buddy system for beginners, shadow duty 2-3 weeks.

Psychological safety: "blameless" retro, support for serious incidents.

Load audit: ≤ 2 "awakenings" per night on average per engineer - target; above → recycling of the alert/architecture.

11) Operational Performance Metrics

MTTD/MTTR by domain (login/deposit/WS/games).

Alert quality:% noisy/closed no action, average number of alerts/shift.

Change failure rate:% of incidents caused by releases; mean time between failures.

Toil: share of repeatable manual tasks → automation plan.

Provider impact: share of SEV-2/1 due to external partners (argument for SLA/migration).

12) Tools and panels of the "attendant"

"Red" dashboard SLO: login/deposit/bets/launch games, 5xx/429, p95, regions.

Causal panels: DB/queues/cache, PSP/providers, CDN/WAF.

On-call dispatcher: active incidents, update timers, one-click links to runbook and phicheflags.

Timeline - who did what, when, with reference to SLO.

13) Typical scenarios and quick fixes

A. Deposits fall in DE at PSP-A

Actions: canary marshrut→ PSP-B 50%; raise the timeout of webhooks; Include JS Challenge in WAF from bots.

Comms: "Degradation DE deposits via PSP-A" status page.

Output: success ≥ 99% 15 min, retray queue

B. Rise of p95 WS in APAC live games

Actions: increase the replicas of WS gateways, turn on the warm pool of nodes; rate-limit broadcast messages; Provider - RTT ticket.

Output: p95 WS RTT ≤ 120 ms 20 min.

C. Game Provider Lag (TTFS> 1. 2 s)

Actions: switch lobby to alternative tables/studios, enable metadata cache; status update.

Output: TTFS <800 ms, ↓ complaints.

14) 24/7 Readiness Checklist

Rotations and shifts are approved, "second number" on each shift.
SLO alerts + causal, anti-noise, uniform message patterns.
Full runbook 'and with "fast levers" (phicheflags, PSP/providers, limits).
Contacts 24 × 7 external partners, call test once a quarter.
Status page and external update templates.
GameDay/DR exercises on schedule, retrospectives without accusations.
On-call tools: dashboards, timeline, solution log.
Compensation/time-off policy, night wake-up limit, health support.
Post-incident process: RCA at 48 hours, remediation tasks with owners and deadlines.

15) Post-mortem template (blameless)

1. In brief: what happened when, what SEV, impact and scope.

2. Time line: detection → escalation → action → stabilization.

3. Root causes: those/processes/people/suppliers (5 Why).

4. What worked/what didn't: alerts, ranbooks, communications.

5. Action items: technical, process, partner - responsible and deadlines.

6. Prevention: tests/monitoring/drills, SLO/alert changes.

Resume Summary

Successful 24/7 casino operations are SLO discipline, properly designed alarming without noise, clear runbooks and escalations, regular exercises and respect for on-call people. Link SLO panels to fast levers (phicheflags, PSP/provider switching, degradation of heavy features), maintain communications with players and partners, measure efficiency (MTTD/MTTR/alert quality) - and your platform will be stable around the clock, and the team - productive and stable.