Realtime Feed Events: Architecture and Security
Real-time event feed is a "circulatory system" of products with gamification, anti-fraud, CRM triggers and payments. In order for it to work predictably at peaks, not duplicate awards and not leak data, a strict architecture is needed: from protocols and bus to signature, idempotency, budget privateers and observability.
1) Objectives and requirements
Reliability: delivery of events with a minimum lag (p95 ≤ 250 ms ingest, p95 ≤ 1-2 s to the consumer).
Delivery: at-least-once transport + logical exactly-once through idempotence.
Ordering: ordering by key (usually 'user _ id') with reordering windows.
Security: MTLS, HMAC/JWT, key rotation, replay/duplicate protection, PII minimization.
Scale: horizontal sharding, backpressure, rate limiting, DLQ/replay.
Manageability: schema registry, versioning, migrations without downtime.
Compliance: audit (WORM), RG/KYC gates, geo-policies, GDPR/anonymization.
2) Reference architecture (layer by layer)
1. Producers (sources): game server, wallet/payment, KYC/AML, client SDKs (web/iOS/Android).
2. API Gateway (Ingress): HTTP/gRPC reception, schema validation, authentication, HMAC/MTLS, time normalization.
3. Queue/Stream: Kafka/Rabbit/Cloud Pub/Sub - buffering, partitioning by 'user _ id'.
4. Stream Processor: normalizer, enrichment, topic routing, anomaly/anti-bots flags, idempotent recording.
5. Consumers: Rules/Scoring, Reward Orchestrator, CRM/CDP connectors, anti-fraud, analytical sink 'and.
6. Storage: immutable events, storefronts, index for idempotency, audit log.
7. Observability: metrics, logs, traces, alerts; panics → DLQ.
8. Admin Plane: schema registry, keys/secrets, RBAC/ABAC, phicheflags, replay service.
3) Delivery protocols: when to use what
HTTP/JSON (server-to-server webhooks): simple, compatible, good for external partners. Add HMAC + idempotency.
gRPC/Protobuf: low latency, strict schemes, bidi streams for internal services.
WebSocket/SSE: push to client, UI subscriptions to leaderboards and progress.
CDC/Kafka Connect: when sources are databases/wallets, not business services.
Recommendation: outer perimeter - HTTP + HMAC + MTLS; inside - gRPC/Protobuf.
4) Event and Convention Model
json
{
"event_id": "e_01HF3Z8Z7Q8Q2K", "event_type": "bet", "schema_version": "1. 3. 0", "occurred_at": "2025-10-24T11:37:21Z", "ingested_at": "2025-10-24T11:37:21. 183Z", "key": { "user_id": "u_12345" }, "ctx": {
"session_id": "s_778", "platform": "ios", "geo": "TR", "device_fp": "fp_4a1..."
}, "payload": {
"game_id": "slot_wolf", "bet": 0. 5, "win": 1. 25, "currency": "EUR", "provider": "GameCo"
}, "sig": {
"algo": "HMAC-SHA256", "kid": "k_2025_10", "ts": 1730061441, "mac": "c7b7b3...f1"
}
}
Rules:
- The two times are 'occurred _ at' (source) and 'ingested _ at' (gateway). Allow the clock to drift ± 300 s.
- The routing key is what determines the order (usually 'user _ id').
- PII only in 'ctx '/' payload' on the principle of minimization; for sensitive attributes, tokenization.
5) Delivery, order and idempotence
At-least-once transport → duplicates and reordering are possible.
Exactly-once logic: keep an idempotence table with a unique index on '(event_id)' and/or '(user_id, source_seq)'; repeat - no-op.
SQL sketch:sql
CREATE TABLE event_log (
event_id TEXT PRIMARY KEY, user_id TEXT, event_type TEXT, occurred_at TIMESTAMPTZ, payload JSONB
);
-- double-protected insert
INSERT INTO event_log(event_id, user_id, event_type, occurred_at, payload)
VALUES (:event_id,:user_id,:event_type,:occurred_at,:payload)
ON CONFLICT (event_id) DO NOTHING;
Order: partitioning the stream by 'user _ id' + reordering window 60-120 sec in the processor. Later, the events that come in fall into the "replay" of the corrective functions.
6) Backpressure and peak management
Token-bucket rate limiting на ingress (per-IP, per-partner, per-key).
Circuit breaker: with 5xx from internal consumers, degradation → (dropout of optional events, queues increase retray intervals).
DLQ: permanently erroneous messages (bitmap, signature invalid, signature TTL exceeded).
Replay service: Selective DLQ replay by 'event _ id '/time range.
7) Schemes and evolution: how not to "break" the prod
Schema Registry: JSON Schema/Protobuf; compatibility policies: backward for producers, forward for consumers.
Versioning: 'schema _ version', major - only through ficheflag and double-write (dual-write).
Contracts: promotion of the scheme after the canary period and green metrics.
YAML example validation rule:yaml compatibility:
enforce: true mode: backward blocked_fields:
- payload. ssn
- payload. card_number required_fields:
- event_id
- event_type
- occurred_at
8) Threat model and protection
Threats: body spoofing, replay, PII leak, key compromise, schema-poisoning, DoS, MITM, signature-stripping.
Protection:- MTLS on the perimeter: short-term client certificates, CRL/OCSP.
- HMAC body signature + 'X-Timestamp' and TTL (± 300 s).
- JWT (client credentials/OAuth2) - for authorization and scope restrictions.
- Key rotation (KMS): 'kid' in header; rotation plan 30-90 days; double check in the migration window.
- Nonce/idempotency: 'X-Request-Id' for side requests (payments, bonuses); keep TTL for a while.
- Content-Type pinning, max body size, allow-list IP/ASN for critical integrations.
- WORM audit of all incoming raw-payload + headers (unchangeable storage).
python body = request. raw_body ts = int(request. headers["X-Timestamp"])
assert abs(now() - ts) <= 300 # анти-replay kid = request. headers["X-Key-Id"]
secret = kms. fetch(kid)
mac = hmac_sha256(secret, body)
assert hmac_eq(mac, request. headers["X-Signature"])
9) Privacy, PII and RG/KYC
Minimization: transfer PII by token link (for 5-15 minutes) instead of inline; editing/anonymization in raw logs is prohibited - use separate PII pages.
Access: ABAC by jurisdiction and role attributes; all reads - to the audit log.
GDPR: exercise the right to delete via key-mapping to erase the PII without breaking the facts of events.
RG/KYC: Events that require valuable rewards, skip only with valid KYC level and RG flags "OK."
10) Observability and SLO
SLO (example):- Ingest p95 ≤ 250 ms; end-to-end p95 ≤ 2 с; failure of ≤ 0. 1 %/day.
- Error Signature (HMAC/JWT) ≤ 0. 02% of the total flow.
- DLQ fill rate ≤ 0. 1%; "duplicates" after idempotency ≤ 0. 005%.
- RPS by source, p50/p95 latency, 4xx/5xx, error signatures, time-skew.
- Lag by party, processing/sec, fill DLQ, retries, replay volumes.
- Schemes: share of messages by version, compatibility violations.
- Safety: rps-throttle rate, circuit-breakers, IP/ASN anomalies.
- SRM of the flow (sharp distortion of traffic from one source).
- Latency p95> target 5 min +, DLQ growth> X/min.
- Error signatures> Y ppm, 'X-Request-Id' replay spike.
- Drift hours> 120 s at source.
11) Multi-region and fault tolerance
Active-Active regions, global routing (GeoDNS/Anycast), sticky-key by 'user _ id' → region.
Cross-regional replication topic for critical events (monetary, KYC).
Blast radius: isolation by tenants/brands, individual budgets and keys.
DR-plan: RPO ≤ 5 min, RTO ≤ 30 min; regular rehearsals.
12) Retention and replay policies
Raw events: 7-30 days (by cost), aggregates/showcases - longer.
Replay is allowed only by a signed runbook (who, what, why, time range).
Replay always goes to a new stream-version with the 'replayed = true' flag for analytical transparency.
13) Configuration examples
Ingress (NGINX style) limits:nginx limit_req_zone $binary_remote_addr zone=req_limit:10m rate=300r/s;
limit_req zone=req_limit burst=600 nodelay;
client_max_body_size 512k;
proxy_read_timeout 5s;
Kafka (example):
properties num. partitions=64 min. insync. replicas=2 acks=all retention. ms=604800000 # 7 days compression. type=zstd
Policy for Keys (KMS):
yaml rotation_days: 45 grace_period_days: 7 allow_algos: ["HMAC-SHA256"]
key_scopes:
- topic: "wallet_events"
producers: ["wallet-svc"]
consumers: ["ledger-svc","risk-svc"]
14) Real-time feed launch checklist
- MTLS on perimeter, HMAC/JWT, key rotation ('kid').
- Idempotence on logic (unique keys, upsert/ON CONFLICT).
- Partitioning by 'user _ id', reordering window, replay service.
- Schema registry + compatibility policies; dual-write for major updates.
- Rate limiting, circuit breakers, DLQ + manual review.
- Observability: SLO, alerts by signature/latency/DLQ/lag.
- PII/anonymization policy, ABAC, WORM audit.
- DR/multi-region, feilover rehearsals.
- Runbooks: incidents, replay, schema/key rollback.
15) Mini Case (Synthetic)
Context: Tournament peak, 120 to RPS ingress, 64 games, 2 Active-Active regions.
Total for 4 weeks: ingest p95 210 ms, e2e p95 1. 6 s; DLQ 0. 05%; error signatures 0. 009%; duplicates after idempotency 0. 003%.
Incident: partner's clock drift (− 9 min) → anti-replay surge. Circuit breaker transferred the stream to the "buffer" endpoint, partner-health notified CSM; after the NTP bruise - window replays 12 min to all consumers. There are no losses and double payments.
16) Summary
A reliable real-time feed is not "just webhooks." This is a layered system with clear contracts: at-least-once transport + logical exactly-once, register scheme and versioning, MTLS/HMAC/JWT and key rotation, backpressure/DLQ/replay, PII minimization and strict audit. By following these practices, you will get a fast, secure and predictable stream of events on which you can confidently build gamification, anti-fraud, CRM and payments.