API and Infrastructure Monitoring Tools
1) Principles: from goals to tools
SLO-first: choose and customize tools for product purposes (login, deposit, rate), and not vice versa.
Open standards: OpenTelemetry (trails/metrics/logs), Prometheus exposition format, Loki JSON logs.
Single context: 'trace _ id '/' span _ id' in logs and metrics; links "dashboard → trace → log."
Cost-aware: cardinality of metrics, TTL logs, sampling trails - in advance.
2) Metrics: collection, storage, visualization
Сбор: Prometheus / Agent-режим (VictoriaMetrics Agent, Grafana Agent, OpenTelemetry Collector).
Storage (TSDB): Prometheus (single), Thanos/Cortex/Mimir (scale-out), VictoriaMetrics (CPU/RAM savings).
Visualization: Grafana as "glass panel."
What to measure for API (RED) and infrastructure (USE):- RED: `rate(requests)`, `error_ratio`, `latency p95/p99` по `route`, `method`, `provider`.
- USE: CPU/Mem, file descriptors, connection pools, queue lag, GC pauses.
- k8s: kube-state-metrics, node-exporter, cAdvisor, ingress/gateway exporters.
- БД/кэши: postgres_exporter, mysql_exporter, redis_exporter, kafka_exporter, rabbitmq_exporter.
- Service mash: Envoy metrics, istio/Linkerd dashboards.
- PSP/внешние: custom exporters (webhook success, PSP success ratio, callback latency).
promql
Deposit Success Rate (SLI)
sum(rate(ig_payments_requests_total{route="/payments/deposit",status=~"2.."}[5m]))
/
sum(rate(ig_payments_requests_total{route="/payments/deposit"}[5m]))
p95 latency API histogram_quantile(0. 95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))
DB Connection Pool Saturation db_connections_in_use/ db_connections_max
3) Logs: search, correlation, immutability
Stack: OpenSearch/Elasticsearch + Beats/Vector/Fluent Bit or Grafana Loki (cheaper to store, log-as-stream).
Format: JSON with standard fields' ts, level, service, env, trace_id, user_pid, route, status, latency_ms'.
Practices: PII masking, WORM audit buckets, TTL/ILM policies, 'env/region/brand' partitioning.
4) Tracing: where milliseconds are lost
Стек: OpenTelemetry SDK/Collector → Jaeger/Tempo/Honeycomb/New Relic Traces.
Sampling policy: 100% errors, † for "slow" requests, 1-5% successful.
Теги iGaming: `provider`, `psp`, `risk_decision`, `bonus_id`, `market`, `ws_table_id`.
A quick recipe for a debate: from the red graph SLO → trace of a problem route → a "thick" span on a PSP/game provider → a webhook log.
5) APM platforms: when all-in-one
Commercial solutions (Datadog, New Relic, Dynatrace, Grafana Cloud) close APM, logs, trails, synthetics, RUM.
Pros: speed of implementation, correlation out of the box. Cons: cost/vendor lock.
Hybrid: core on OSS (Prometheus + Grafana + Tempo + Loki), "finish" synthetics/alert with commercial modules on critical flow.
6) Synthetics and RUMs: "outside" and "through the player's eyes"
Синтетика: Checkly, Grafana Synthetic Monitoring, k6 Cloud, Uptrends, Pingdom, Catchpoint, ThousandEyes.
Scripts: login → deposit (sandbox) → game launch → webhook check.
Geo: EU/LatAm/MEA/APAC, mobile networks, ASN mix.
RUM: Web-SDK (TTFB/LCP/CLS), mobile SDK; segmentation by country/network/device.
7) Kubernetes-monitoring surfaces
Control plane: etcd, API-server (apiserver_request_total, latency), scheduler/controllermanager.
Data plane: kubelet, CNI, ingress/gateway; `PodDisruptionBudget` и эвикшены.
Autoscale: HPA/VPA/Cluster Autoscaler metrics and events; warm pools.
Network policies: drops/deny events, DNS latency.
8) Databases, queues, caches
Postgres/MySQL: replication lag, deadlocks, bloat, WAL, checkpoint duration, timeouts.
Kafka/RabbitMQ: consumer lag, rebalances, queue depth, redeliveries.
Redis: events, blocked clients, latency percenttiles, replica lag.
PITR/backups: backup operator tasks + time-to-restore dashboard.
9) Network, CDN, WAF, game providers and PSPs
CDN/Edge: hit-ratio, TTFB by region, shield hit, "miss storm."
WAF/bot manager: share challenges/blocks, ASN/countries, FPR on login/deposit.
Game providers: table/slot start time, failure/timeouts by studio.
PSP: success ratio/latency by method/country/BIN, 3DS/AVS error codes, webhooks success & delay.
10) Alerting and duty
Routing: Alertmanager → PagerDuty/Opsgenie/Slack.
Rules: symptomatic (SLO) + causal (resources).
Anti-noise: grouping, suppression of chain alerts, windows of silence for release.
SLO gates in CD: auto-pause/rollback on violations (Argo Rollouts/Flagger AnalysisRun).
Examples of alerts (simplified):- `login_success_ratio < 99. 9% for 10m`
- `p95 /payments/deposit > 0. 4s for 10m`
- `db_connections_saturation > 0. 85 for 5m`
- `kafka_consumer_lag > 30s`
- `cdn_hit_ratio drop > 15% in 10m (per region)`
11) Dashboards that really help
Deposit flow: funnel, p95/p99, PSP/BIN/country errors, webhook delay.
Live games/WS: connections, RTT, resend/reconnect, errors by provider.
API health: RED by routes, saturations, top slow endpoints ↔ trace.
DR panel: replication lag, WAL shipping, synthetic login/deposit from DR region.
Security: WAF, bot score, 401/403 anomalies, signed webhooks.
12) Telemetry Cost Management
Cardinality of metrics: do not include 'user _ id' in labels, limits on 'route' and 'provider'.
Downsampling and retention classes (hot 7-14 days, warm 30-90, cold archive).
Logs: event jump - enable sampling/dedup; store stacktrace separately.
Traces: dynamic sampling along "expensive" paths (payments/conclusions).
13) Security and privacy in monitoring
mTLS from agents to collectors; at-rest encryption.
Pseudonymization of'user _ pid', prohibition of e-mail/phone/documents in the logs.
RBAC/MFA, WORM for audit; DPA with third-party monitoring providers.
14) Integration with CI/CD and auto rollback
Exposure to SLI as prom metrics for CD analyses.
Release labels ('version', 'rollout _ step') in metrics/logs/traces.
Automatic canary gates: descent will continue only with green SLOs.
15) Fast start stack (reference)
Collection/transport: OTEL Collector + Prometheus/VM Agent + Fluent Bit.
Storage: VictoriaMetrics/Thanos (metrics), Loki/OpenSearch (logs), Tempo/Jaeger (trails).
Visualization: Grafana + ready-made dashboards k8s/Envoy/Postgres.
& RUM synthetics: Checkly/k6 + Grafana RUM (or commercial equivalent).
Alerting: Alertmanager → PagerDuty/Slack; runbooks in references.
16) Implementation checklist (prod-ready)
- SLO/SLI defined for login/deposit/bid/output.
- RED/USE + Business SLI metrics; a single label ontology.
- JSON logs with 'trace _ id', PII masking, WORM for audit.
- OpenTelemetry end-to-end; 100% error sampling.
- Synthetics from key regions + RUM in sales.
- Dashboards "flow deposit," "WS," "API health," "DR."
- Alerting: SLO symptoms + resource causes; anti-noise.
- SLO gates are connected to the CD; auto-rollback.
- Cost plan: retention/sampling/cardinality.
- DPA/security: mTLS, RBAC, log privacy.
Resume Summary
Strong monitoring is not a set of "beautiful graphs," but a coherent system: RED/USE metrics, logs with 'trace _ id', OpenTelemetry traces, synthetics and RUM, plus dashboards, alerts and SLO gates built into your CI/CD. Build a stack around open standards, control the cost of telemetry and standardize label ontology - then any API and infrastructure problems will be visible in advance and repaired before they are noticed by players.