Torrent Gear

Docker and Kubernetes in iGaming: Deploy strategies

1) iGaming Context: Platform Requirements

Real time (live games, bets, tournament events) → strict p95 API/WS.

Traffic peaks (streams/promos) → fast autoscale without a cold start.

Money and compliance → loop isolation, release traceability, access control and auditing.

Multi-jurisdictions/brands → tenas (namespaces/projects), network and resource isolation policies.

Key SLOs: login ≥ 99. 9%, deposit ≥ 99. 85%, p95 API ≤ 250-400 ms, p95 WS RTT ≤ 120 ms.

2) Basic architecture on Kubernetes

Layers: Ingress/Edge → API/gateways → services (wallet, profile, promo, anti-fraud) → queues/streams → storage.

Isolation: 'namespace' on brand/market or "cell" by region; individual NodePools (public API/batch/ws-realtime).

Network policies: 'NetworkPolicy' on a "deny by default" basis, separate egress policies to PSP/KYC/game providers.

Storages: 'StorageClass' with replication within a zone/region, operators for databases/caches (Postgres/MySQL, Redis, Kafka).

3) Container images: quality and safety

Multi-arch (amd64/arm64), distroless or slim-bases, only necessary binaries.

SBOM and vulnerability scanning, image signing (Cosign), reception policy ('ImagePolicyWebhook').

Immutable-tagging: releases by 'sha256'; "latest" is prohibited.

Runtime profiles: 'readOnlyRootFilesystem', 'runAsNonRoot', 'seccomp/AppArmor', minimal Capabilities.

4) Release strategies: when and what to choose

RollingUpdate (default)

No downtime; for most APIs.

Control via readiness/liveness/startup probes, maxUnavailable/maxSurge.

Blue-Green

Parallel stacks Blue and Green; traffic switching at Ingress/Service level.

Good for large schema/config changes; fast rollback.

Canary

Gradual inclusion of a percentage of traffic (5→10→25→50→100).

Trigerim SLO-gates: p95, error-rate, anomalies in deposits/rates.

Options: Service Mesh (Istio/Linkerd), Ingress controller with canary annotations.

A/B и Shadow

Shadow: mirror some of the traffic to the new release without answering the user (pure telemetry).

A/B: functional experiments with flags (feature-flags) and segmentation of players/markets.

5) GitOps and Configuration Management

GitOps (Argo CD/Flux): clusters read the desired state from Git; all changes through PR and review.

Templates: Helm/Kustomize, a single chart library.

Secrets: External Managers (Vault/Cloud SM), 'ExternalSecrets '/' Secrets Store CSI'; KMS keys and rotation.

Pipeline (simplified):

1. CI collects the signed image → push into registers.

2. PR changes image version/config → GitOps applies.

3. Canary rollout with SLO-gates → automatic promotion or auto-rollback.

6) Autoscaling for peaks and WS load

HPA by application metrics (RPS, p95 latency, queue lag), not just CPU/RAM.

KEDA for event skale (Kafka, RabbitMQ, Redis, HTTP-queue).

VPA for daily editing of requests/limits.

Cluster Autoscaler + warm pools of nodes (pre-provision) for the duration of promo/tournaments.

WebSocket specifics:

individual NodePools (more network descriptors), 'PodDisruptionBudget' for soft update, sticky-routing (Session Affinity) via Ingress/Mesh.

7) Stateful-contours: wallet, database, queues

Operators (Postgres/MySQL, Redis Sentinel/Cluster, Kafka Operator): declarative replication, 'PITR', automatic backups.

RPO/RTO policy: synchronous replication within the zone, asynchronous to DR regions.

Idempotency/outbox for deposits/payouts, inbox pattern for PSP webhooks and game providers.

StorageClass with fast IOPS; for the wallet - a separate class and nodes with local SSDs (and replication).

8) Network layer and gateways

Ingress (Nginx/Envoy/HAProxy/ALB) with mTLS to backends, HTTP/2/3, HSTS, rate-limits.

Service Mesh: canary routes, retrays/timeouts, circuit-breakers, TLS within the cluster by default.

Egress gateways: whitelisting to PSP/KYC/providers, DNS and IP control.

9) Observability and SLO release gates

OpenTelemetry: traces through the front→API→platyozh/igrovoy provider; 100% errors and "slow" spans.

RED/USE metrics + business SLI (deposit/bet/output success).

JSON logs with 'trace _ id', WORM for audit.

Release-gates: promote only if SLO is green on the test share.

10) Security: from supply chain to runtime

Policy as Code: OPA/Gatekeeper/Kyverno (prohibition privileged, requirement 'runAsNonRoot', limits, pull-checks of signature).

Secrets and keys: only from Secret Manager; 'envFrom' minimize, sidecar-injection secrets.

Webhooks/webhooks of providers: HMAC signatures, idempotency, egress gateway.

Compliance: audit of releases, artifacts, accesses (RBAC/MFA), geo-isolated storage of CCP artifacts/logs.

11) Multi-region, failover and DR

Active standby by region (minimum for wallet/login/payments).

Traffic routing: GSLB/Anycast; health checks by SLI (login/deposit/rate).

Catastrophic switching: DR-cutover button (freeze writes → promote DB → warm up caches → phased roll of traffic).

Exercise: quarterly GameDay with PSP, zone, game provider "falling."

12) Configuration and feature management

Feature-flags (configs in ConfigMap/External Config) - disabling heavy functions in case of an accident.

Versioned configs (hashes, checksum annotations on Pod), canary config rollout.

Runtime overrides at the Mesh/Ingress level (timeouts, retry policies) without rebild images.

13) Economics and productivity

NodePools by assignment: RPS-API, WS-realtime, batch/ETL.

Spot/Preemptible для batch/ETL с `PodPriority` и `PodDisruptionBudget`.

Batch compilation and warm-up (JIT/template cache) to reduce cold-start.

Resource budgets: requests/limits, VPA recommendations, connection limits to database/PSP, connection pooling.

14) Manifest templates (fragments)

Deployment with canary via Ingress annotations:

yaml apiVersion: apps/v1 kind: Deployment metadata:
name: payments-api spec:
replicas: 6 strategy:
type: RollingUpdate rollingUpdate: {maxSurge: 2, maxUnavailable: 1}
template:
metadata:
labels: {app: payments-api, version: v2}
spec:
securityContext: {runAsNonRoot: true}
containers:
- name: app image: registry/payments@sha256:...
ports: [{containerPort: 8080}]
resources:
requests: {cpu: "300m", memory: "512Mi"}
limits:  {cpu: "1",  memory: "1Gi"}
readinessProbe:
httpGet: {path: /healthz, port: 8080}
periodSeconds: 5

HPA by custom metric (RPS/latency via Prometheus Adapter):

yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: {name: payments-api}
spec:
scaleTargetRef: {apiVersion: apps/v1, kind: Deployment, name: payments-api}
minReplicas: 6 maxReplicas: 60 metrics:
- type: Pods pods:
metric:
name: rps_per_pod target:
type: AverageValue averageValue: "120"

NetworkPolicy (Ingress gateway only and egress needed):

yaml apiVersion: networking. k8s. io/v1 kind: NetworkPolicy metadata: {name: payments-restrict}
spec:
podSelector: {matchLabels: {app: payments-api}}
policyTypes: ["Ingress","Egress"]
ingress:
- from: [{namespaceSelector: {matchLabels: {gw: ingress}}}]
egress:
- to: [{ipBlock: {cidr: 10. 0. 0. 0/8}}] # internal services
- to: [{namespaceSelector: {matchLabels: {svc: psp-egress}}}]

15) Release checklist (prod-ready)

Image signed, SBOM collected, vulnerabilities at acceptable level.
Manifests pass policy-check (Kyverno/OPA), minimum privileges.
Readiness/Startup probes correct; 'PDB' and 'PodPriority' configured.
Canary plan: 5%→10%→25%→50%→100% with SLO gates and auto-rollback.
HPA/KEDA + Cluster Autoscaler; warm-pool nodes for the event.
Secrets from Vault/SM; configs are versioned; feature flags are ready for degradation.
e2e tracing enabled; alerts on SLI (deposit/rate/withdrawal).
DR-plan and "button" cutover are checked at the stand; backups/PITR tested.
Documentation: how to roll back, how to switch PSP/game provider, who to call at night.

16) Anti-regression and type traps

Grace period readiness too short → early 5xx at rollout.

Single DB pool without → limits in case of an avalanche of connections.

Secrets in environment variables without rotation → leak.

Mesh without limits/timeouts → freezes on degrading providers.

HPAs only on CPU → WS/API do not have time to scale.

Resume Summary

Deploy strategies in iGaming are a combination of reliable container practices (secure images, access policy), smart releases (canary/blue-green with SLO gates), the right autoscale (HPA/KEDA + warm nodes for peaks), operators for stateful loops, and multi-regional DR. Add GitOps, tracing through payments and game providers, network policies and savings through specialized NodePools - and your releases will be predictable, fast and safe for money and players.