Torrent Gear

API and Infrastructure Monitoring Tools

1) Principles: from goals to tools

SLO-first: choose and customize tools for product purposes (login, deposit, rate), and not vice versa.

Open standards: OpenTelemetry (trails/metrics/logs), Prometheus exposition format, Loki JSON logs.

Single context: 'trace _ id '/' span _ id' in logs and metrics; links "dashboard → trace → log."

Cost-aware: cardinality of metrics, TTL logs, sampling trails - in advance.

2) Metrics: collection, storage, visualization

Сбор: Prometheus / Agent-режим (VictoriaMetrics Agent, Grafana Agent, OpenTelemetry Collector).

Storage (TSDB): Prometheus (single), Thanos/Cortex/Mimir (scale-out), VictoriaMetrics (CPU/RAM savings).

Visualization: Grafana as "glass panel."

What to measure for API (RED) and infrastructure (USE):

RED: `rate(requests)`, `error_ratio`, `latency p95/p99` по `route`, `method`, `provider`.
USE: CPU/Mem, file descriptors, connection pools, queue lag, GC pauses.

Useful exporters:

k8s: kube-state-metrics, node-exporter, cAdvisor, ingress/gateway exporters.
БД/кэши: postgres_exporter, mysql_exporter, redis_exporter, kafka_exporter, rabbitmq_exporter.
Service mash: Envoy metrics, istio/Linkerd dashboards.
PSP/внешние: custom exporters (webhook success, PSP success ratio, callback latency).

Examples of PromQL (sketches):

promql
Deposit Success Rate (SLI)
sum(rate(ig_payments_requests_total{route="/payments/deposit",status=~"2.."}[5m]))
/
sum(rate(ig_payments_requests_total{route="/payments/deposit"}[5m]))

p95 latency API histogram_quantile(0. 95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))

DB Connection Pool Saturation db_connections_in_use/ db_connections_max

3) Logs: search, correlation, immutability

Stack: OpenSearch/Elasticsearch + Beats/Vector/Fluent Bit or Grafana Loki (cheaper to store, log-as-stream).

Format: JSON with standard fields' ts, level, service, env, trace_id, user_pid, route, status, latency_ms'.

Practices: PII masking, WORM audit buckets, TTL/ILM policies, 'env/region/brand' partitioning.

4) Tracing: where milliseconds are lost

Стек: OpenTelemetry SDK/Collector → Jaeger/Tempo/Honeycomb/New Relic Traces.

Sampling policy: 100% errors, † for "slow" requests, 1-5% successful.

Теги iGaming: `provider`, `psp`, `risk_decision`, `bonus_id`, `market`, `ws_table_id`.

A quick recipe for a debate: from the red graph SLO → trace of a problem route → a "thick" span on a PSP/game provider → a webhook log.

5) APM platforms: when all-in-one

Commercial solutions (Datadog, New Relic, Dynatrace, Grafana Cloud) close APM, logs, trails, synthetics, RUM.

Pros: speed of implementation, correlation out of the box. Cons: cost/vendor lock.

Hybrid: core on OSS (Prometheus + Grafana + Tempo + Loki), "finish" synthetics/alert with commercial modules on critical flow.

6) Synthetics and RUMs: "outside" and "through the player's eyes"

Синтетика: Checkly, Grafana Synthetic Monitoring, k6 Cloud, Uptrends, Pingdom, Catchpoint, ThousandEyes.

Scripts: login → deposit (sandbox) → game launch → webhook check.

Geo: EU/LatAm/MEA/APAC, mobile networks, ASN mix.

RUM: Web-SDK (TTFB/LCP/CLS), mobile SDK; segmentation by country/network/device.

7) Kubernetes-monitoring surfaces

Control plane: etcd, API-server (apiserver_request_total, latency), scheduler/controllermanager.

Data plane: kubelet, CNI, ingress/gateway; `PodDisruptionBudget` и эвикшены.

Autoscale: HPA/VPA/Cluster Autoscaler metrics and events; warm pools.

Network policies: drops/deny events, DNS latency.

8) Databases, queues, caches

Postgres/MySQL: replication lag, deadlocks, bloat, WAL, checkpoint duration, timeouts.

Kafka/RabbitMQ: consumer lag, rebalances, queue depth, redeliveries.

Redis: events, blocked clients, latency percenttiles, replica lag.

PITR/backups: backup operator tasks + time-to-restore dashboard.

9) Network, CDN, WAF, game providers and PSPs

CDN/Edge: hit-ratio, TTFB by region, shield hit, "miss storm."

WAF/bot manager: share challenges/blocks, ASN/countries, FPR on login/deposit.

Game providers: table/slot start time, failure/timeouts by studio.

PSP: success ratio/latency by method/country/BIN, 3DS/AVS error codes, webhooks success & delay.

10) Alerting and duty

Routing: Alertmanager → PagerDuty/Opsgenie/Slack.

Rules: symptomatic (SLO) + causal (resources).

Anti-noise: grouping, suppression of chain alerts, windows of silence for release.

SLO gates in CD: auto-pause/rollback on violations (Argo Rollouts/Flagger AnalysisRun).

Examples of alerts (simplified):

`login_success_ratio < 99. 9% for 10m`
`p95 /payments/deposit > 0. 4s for 10m`
`db_connections_saturation > 0. 85 for 5m`
`kafka_consumer_lag > 30s`
`cdn_hit_ratio drop > 15% in 10m (per region)`

11) Dashboards that really help

Deposit flow: funnel, p95/p99, PSP/BIN/country errors, webhook delay.

Live games/WS: connections, RTT, resend/reconnect, errors by provider.

API health: RED by routes, saturations, top slow endpoints ↔ trace.

DR panel: replication lag, WAL shipping, synthetic login/deposit from DR region.

Security: WAF, bot score, 401/403 anomalies, signed webhooks.

12) Telemetry Cost Management

Cardinality of metrics: do not include 'user _ id' in labels, limits on 'route' and 'provider'.

Downsampling and retention classes (hot 7-14 days, warm 30-90, cold archive).

Logs: event jump - enable sampling/dedup; store stacktrace separately.

Traces: dynamic sampling along "expensive" paths (payments/conclusions).

13) Security and privacy in monitoring

mTLS from agents to collectors; at-rest encryption.

Pseudonymization of'user _ pid', prohibition of e-mail/phone/documents in the logs.

RBAC/MFA, WORM for audit; DPA with third-party monitoring providers.

14) Integration with CI/CD and auto rollback

Exposure to SLI as prom metrics for CD analyses.

Release labels ('version', 'rollout _ step') in metrics/logs/traces.

Automatic canary gates: descent will continue only with green SLOs.

15) Fast start stack (reference)

Collection/transport: OTEL Collector + Prometheus/VM Agent + Fluent Bit.

Storage: VictoriaMetrics/Thanos (metrics), Loki/OpenSearch (logs), Tempo/Jaeger (trails).

Visualization: Grafana + ready-made dashboards k8s/Envoy/Postgres.

& RUM synthetics: Checkly/k6 + Grafana RUM (or commercial equivalent).

Alerting: Alertmanager → PagerDuty/Slack; runbooks in references.

16) Implementation checklist (prod-ready)

SLO/SLI defined for login/deposit/bid/output.
RED/USE + Business SLI metrics; a single label ontology.
JSON logs with 'trace _ id', PII masking, WORM for audit.
OpenTelemetry end-to-end; 100% error sampling.
Synthetics from key regions + RUM in sales.
Dashboards "flow deposit," "WS," "API health," "DR."
Alerting: SLO symptoms + resource causes; anti-noise.
SLO gates are connected to the CD; auto-rollback.
Cost plan: retention/sampling/cardinality.
DPA/security: mTLS, RBAC, log privacy.

Resume Summary

Strong monitoring is not a set of "beautiful graphs," but a coherent system: RED/USE metrics, logs with 'trace _ id', OpenTelemetry traces, synthetics and RUM, plus dashboards, alerts and SLO gates built into your CI/CD. Build a stack around open standards, control the cost of telemetry and standardize label ontology - then any API and infrastructure problems will be visible in advance and repaired before they are noticed by players.