SLA between operator and providers: metrics and penalties
1) Why SLAs and how to manage them
SLA records the expected quality of service (SLO goals, support windows), how we measure this, and what happens in case of violations (service loans/fines, escalations, output options). For iGaming, this is critical: real-time money, regulators, traffic peaks and multilayer dependencies (games → wallet → PSP → KYC → CDN/WAF).
Principles:- Measurability and unambiguity (who, where and what measures).
- Proximity to business (metrics by login/deposits/game launch, not just CPU).
- Economic incentive (service loans are tied to damage).
- Management (quality committee, monthly QBR, PoP reports).
2) Set of metrics by domain
2. 1 Payment Providers (PSP)
Deposit Success Ratio (DSR): number of successful deposits/all attempts, by country/method/BIN. Target ≥ 99. 0%.
Authorization/Settlement Latency p95: target ≤ 400-600 ms.
Webhook Delivery Delay p95: target ≤ 60 s (T + 60).
Availability (API/Callbacks): ≥ 99. 9 %/month (excluding agreed windows).
2. 2 Gaming providers/aggregators
TTFS (Time-to-First-Spin) p95: ≤ 800 ms (from lobby to first spin).
Game Launch Success: ≥ 99. 5%.
Round Result Callback Success: ≥ 99. 9%, p95 delay ≤ 5 s.
Content Availability: ≥ 99. 95% catalog (share of available games).
2. 3 KYC/AML providers
Verification API Availability: ≥ 99. 9%.
Median Time-to-Decision: ≤ 60 c (auto), ≤ 15 мин (manual queue).
False Negative/Positive Boundaries: target corridors by market (based on agreed sample).
2. 4 Edge/CDN/WAF
TTFB p95: ≤ 200 ms (regional).
Cache Hit Ratio: ≥ 85% of static assets.
Bot-challenge pass-through: FP ≤ 0. 5% on login/deposit.
2. 5 Hosting/Cloud/Network
Availability (region/zone): ≥ 99. 95% (zone), RTO ≤ 30 min, RPO ≤ 5 min for wallet.
Ingress/Load Balancer Latency p95: ≤ 100ms in the region.
3) Formulas and measurement
General measurement rules
Calculation time zone: Europe/Kyiv. Reporting month - calendar month.
The clock is counted according to UTC in telemetry with conversion to Kyiv for reports.
Time synchronization: NTP; error ≤ 100 ms.
Source of truth: operator synthetics + server logs + provider. Divergence uses the worst of the two unless proven otherwise.
Examples of formulas
text
Availability = 1 - (Σ Downtime_min) / (Total_min_in_period)
Downtime_min - minutes when> = X% errors/timeouts and/or complete unavailability.
The threshold of X is fixed (for example, error_rate ≥ 5% or p95_latency ≥ by SLO×2).
Deposit Success Ratio = success_count / (success_count + failure_count)
Latency p95 = histogram_quantile(0. 95, rate(latency_bucket[5m]))
TTFS p95 = p95(time(game_open → first_spin_callback))
Webhook Delay p95 = p95(time(webhook_received – event_time))
Planned Maintenance Windows
Windows are agreed in 7 days, no more than 1 ×/month for 60 minutes, fall out of the SLA calculation. Emergency windows (Security) - for 24 hours of notification.
4) Classification of incidents and reactions
Communications: status page/channel, post-mortem ≤ 5 working days.
5) Service loans and fines
5. 1 Line of credits (example)
Monthly Availability:99. 9%–99. 5% → credit 5% of the provider's monthly fee/commission.
99. 5%–99. 0% → 10%.
PSP DSR violation: every full 0. 5 pp below 99. 0% → credit 2%, cap 20%.
Webhook Delay p95> SLO × 2 more than 60 min in total → 5%.
TTFS p95> 800 ms more than 120 min → 5%.
Chronic failure: 3 months in a row with loans ≥ 10% → the right to early termination without a fine + assistance in migration (fixed price/hour limit).
5. 2 Economic logic
Net offset loans (reduce provider accounts).
With RevShare - gross loans from the provider's fee (its share), not from GGR/NGR as a whole.
Monthly cap on loans: usually 100% of the monthly fee, except fraud/data.
5. 3 Earn-back (option)
The provider can "earn" part of the loan back if it reaches an enhanced SLO next month (for example, Availability ≥ 99. 99% for a whole month).
6) KPI weighting model (for quarterly bonuses/malus)
'QuarterScore = Σ (Weight × Point/5) '→ bonus/malus ± X% to the rate.
7) Example summary report (CSV fish)
Provider,Month,Availability,DSR,TTFS_p95_ms,Webhook_p95_s,Credits%
PSP-A,2025-09,99. 62%,98. 8%,--,45,12
Games-X,2025-09,99. 97%,--,780,3,0
KYC-Z,2025-09,99. 91%,--,--,--,0
CDN-W,2025-09,99. 99%,--,120,--,0
8) Exclusion rules and force majeure
Exceptions: accidents at third parties outside the provider's perimeter, if provable and documented, and if there are correct fault tolerance routes.
Force majeure: only events from the standard list (elements/war/regulatory blocking), with timely communication and attempts to mitigate damage (DR).
Shared-fault (divided wine): loans are divided in proportion to the confirmed contribution.
9) Quality check and audit
Operator access to metrics/logs/tracks (read-only).
Quarterly security-scan and vulnerability remediation report.
DR exercise: 1 ×/quarter, report with RTO/RPO.
Reconciliation of PSP reports/games with a discrepancy ≤ 0. 5%.
10) Escalation and Management
Contact list 24/7 (L1/L2, partner manager).
War-room when SEV-1.
QBR: quarterly analysis of KPIs, loans/earn-backs, roadmap.
Improvement Plan (CAP) with dates and owners.
11) Clause templates (fragments)
SLO and measurement
Service credits
Chronic failure & Termination
Data and webhooks
Scheduled windows
12) Frequent traps and how to avoid them
Blurred definitions of "unavailability" → fix error/latency thresholds.
Without taking into account geography, goals are → by region, and not the average globally.
No SLO according to → add SLA to webhooks/exports, otherwise the reports are "late."
Fines without cap/earn-back → do predictably and fairly.
Without DR requirements → record RTO/RPO and drill frequency.
13) SLA Implementation Checklist (prod-ready)
- KPIs are finalized by domain: PSP, games, KYC, CDN/WAF, cloud.
- Measurement sources and formulas are described; time zone and windows confirmed.
- Maintenance windows and notification procedure are consistent.
- Table of service loans, cap and chronic-failure clause.
- SEV escalation procedures, war-room, post-mortem ≤ 5 days
- Telemetry access (metrics/logs/trails) issued, connectivity test passed.
- DR requirements (RTO/RPO) and exercise schedule are fixed.
- QBR rhythm, scorecard and annual goals are aligned.
- Legal exceptions/force majeure are clearly described.
- Test report for the pilot month with calculation of credits.
Resume Summary
Working SLAs are clear business metrics, transparent measurement rules, a well-thought-out line of credits and live quality management (QBR, CAP, exercises). Pin KPIs by domain (PSP, games, KYC, edge/cloud), agree on the sources of truth and exceptions, enter a weight model and earn-back - and your relationship with providers will become predictable, and the risk to the player's money and UX will decrease significantly.