Data Lake and DWH for casinos: schematics, SLA downloads
Full article
1) Why Data Lake Casino and DWH
Reporting and compliance: regulatory uploads (GGR/NGR, KYC/AML, RG), money audit.
Product/marketing: LTV/retention, segmentation, A/B, recommendations.
Operations: monitoring of providers, PSP, SLA live games and cash registers.
Data solutions: fast storefronts (DWH) on top of cheap long-term storage (Lake).
Bottom line: Lake stores raw and cleaned layers, DWH gives quick queries and managed models.
2) Reference architecture (lakehouse)
Sources (OLTP, Kafka, Webhooks, CDC)
│
├─Bronze (raw, append-only; Parquet/Delta/Iceberg)
│ ingestion_time, source_metadata, no schema changes in place
├─Silver (cleaned, conformed; dedup, PII masking, SCD2)
│ business keys, constraints, quality checks
└─Gold (marts; star/snowflake; cube tables, aggregates)
└─DWH/Query Engines (Snowflake/BigQuery/Trino/Spark SQL)
Форматы: Delta Lake / Apache Iceberg / Hudi (ACID в lake, time travel, MERGE).
Files: Parquet + ZSTD/Snappy, target ~ 128-512 MB; "small file" compaction.
Catalog: Hive/Unity/Iceberg Catalog; zones' bronze/silver/gold'on buckets per region/tenant.
3) Domain schemes (conceptually)
3. 1 Wallet/accounting
3. 2 Bets/Settlement (RGS/live)
`bet`: `bet_id`, `round_id`, `player_id`, `game_id`, `stake_minor`, `currency`, `placed_at`, `brand/region`, `provider_id`, `in_bonus`.
`settlement`: `settlement_id`, `bet_id`, `round_id`, `win_minor`, `settled_at`, `jackpot_hit`, `bonus_state`.
3. 3 Payments (cash desk/PSP/crypto)
`payment_intent`: `intent_id`, `player_id`, `method`, `status`, `amount`, `currency`, `psp`, `created_at`.
'capture/refund/chargeback ': separate tables with references to'intent _ id', 'psp _ ref ', reason codes.
Крипто: `txid`, `network`, `confirmations`, `finalized_at`.
3. 4 Bonuses/vager/jackpots
`bonus_grant`, `bonus_progress (wager)`, `jackpot_contribution`, `jackpot_payout`.
3. 5 References and measurements
'dim _ player '(pseudo-ID, geo, channels, RG statuses - without PII in analytics),' dim _ game ',' dim _ provider ',' dim _ psp ',' dim _ brand ',' dim _ region ', calendar dimensions.
Keys and compatibility: in Silver/Gold models - stable business keys ('bet _ id', 'round _ id', 'payout _ id', 'intent _ id') and semantics of "idempotent" events.
4) Download Streams: Streaming + Microbatch
Streaming (Kafka/Pulsar → Bronze): OLTP and webhook events, outbox/CDC, at least once guarantees with deduplication in Silver.
CDC (Debezium/replication log): changing OLTP tables (wallet/payments) → Bronze.
Microbatches: PSP/bank/custom reports (SFTP/API) → Bronze Raw Files → normalization.
MERGE in Silver: dedup by 'idempotency _ key/event _ id', elimination of latecomers ('watermark'), SCD2 on measurements.
5) SLA downloads and late window (watermarks)
5. 1 Typical SLAs (landmarks)
Wallet/ledger events: Bronze ≤ 1-2 min, Silver ≤ 5-10 min, Gold marts ≤ 15 min
Bets/settlements: Bronze ≤ 1-2 min, Silver ≤ 10 min, Gold ≤ 30 min
Payments (PSP webhooks): Bronze ≤ 5 min, Silver ≤ 15 min, Gold ≤ 30-60 min.
Crypto finality: network dependent; display cases with lag N confirmations.
Daily PSP/bank reports: T + 1 until 09:00 local time of the region.
5. 2 Late windows
Watermark by event time ('occurred _ at') + tolerances:- wallet/bets: 24-48 hours, payments/PSP: 72 hours (there are retro webhooks), crypto: up to 24 hours for rare reorgs.
- Later events of reprocess: recalculation of Gold windows incrementally (MERGE), correction log.
5. 3 SLA Communication
The data catalog contains SLA attributes: 'freshness _ target', 'freshness _ status', 'expected _ lag _ p95', 'watermark'.
Dashboards of "freshness" with alerts in violation.
6) Data quality (DQ) and contracts
Data Contracts for each topic: Avro/JSON schemes, semver, required fields, business invariants (for example, 'win _ minor ≥ 0', 'currency ∈ ISO-4217').
Silver DQ checks: key uniqueness, referential integrity, balance check (wallet reconciliation), validity of PSP codes/reasons, date ranges.
Severity: 'ERROR' (blocking), 'WARN' (marking), 'INFO'.
Monitoring:% violations, top reasons, automatic tickets.
Sampling & replay: Store raw Bronze for recycling.
7) PII, residency and safety
PII-showcase is separated from analytics: in Silver/Gold - pseudonym, masking/hashes, tokenization.
Data residency: EU/UK/BR, etc. - physically separate buckets/catalogs; no cross-regional reading without consent and proxies.
Доступ: RBAC/ABAC (Lake/DWH), row-level security по `tenant/brand/region`.
Encryption: at-rest (KMS) and in-transit, per region/brand keys, WORM audit of access and policy changes.
The right to be forgotten: a mechanism for localizing game data without deleting financial records (de-identification).
8) Gold Window Modeling (Star)
8. 1 Actual tables
'fact _ bets', 'fact _ wallet _ entries', 'fact _ payments', 'fact _ bonus _ wager', 'fact _ jackpot'.
8. 2 Measurements
`dim_date/time`, `dim_player` (pseudonymous), `dim_game`, `dim_provider`, `dim_psp`, `dim_brand`, `dim_region`, `dim_currency`.
8. 3 Metrics and calculations
GGR/NGR, hold/frequency, RTP (by game/provider/region), deposit conversion, settle lag, success-rate PSP, cost-per-success, FX-PnL, jackpot contributions/payouts.
9) Performance and cost
Partitioning: by 'occurred _ date' + 'region/tenant', sometimes' game _ id'for Gold aggregates.
Clustering/Z-Order: by 'player _ id', 'game _ id', 'psp', 'currency'.
Compression and vacuum: planned 'OPTIMIZE/COMPACT', removal of "hanging" versions (taking into account legal retentions).
Caches: result-cache/warehouse cache, materialized views for hot panels.
Indexes in DWH: cluster/segment (Snowflake clustering keys, BigQuery partition + cluster).
Cost: cold Bronze in object storage, hot Gold/March units in DWH; auto-parking/auto-scale.
10) Linage, catalogs and documentation
Data Catalog (OpenMetadata/Amundsen/Collibra): table description, owner, SLA, PII fields, access policies.
Lineage: from source (event/CDC) to showcase and report; visibility of constraints for safe changes.
Changelog schemes: semver and journal of deprecates; compatibility tests in CI pipelines.
11) Reconciliation
Daily:- 'wallet _ entry '↔ total balances (accumulation ≡ snapshot), payments: PSP/bank reports ↔' fact _ payments', crypto: 'txid/network' ↔ 'fact _ payments'.
- Категории: `match`, `timing`, `missing_source`, `missing_platform`, `amount_mismatch`.
- Alerts: proportion of 'mismatch'> threshold; aging outstanding> N days.
12) Instance SLA tables (example)
13) Pipelines: what we collect from
Ingestion: Kafka Connect/Debezium, cloud ingestion services, SFTP pullers.
ETL/ELT: Spark/DBT/Trino/Beam/Flink (streaming Silver), Airflow/Argo for orchestration.
Quality: Great Expectations/Dequ/dbt tests.
Monitoring: OpenTelemetry + Lake/DWH metrics (freshness delay, job latency, cost).
Accidents and repetition: reprocess from Bronze, dedup with keys, versioned pipelines.
14) Checklists
Architecture and security
- Lakehouse format (Delta/Iceberg/Hudi) with ACID and time travel.
- Split 'bronze/silver/gold', outbox/CDC as main sources.
- PII isolation, tokenization, RLS by 'tenant/brand/region'.
- Bucket/directory-level residency, keys/secrets per region.
- WORM audit of schema/policy/access rule changes.
Quality and SLA
- Data Contracts and semver schemes; compatibility tests.
- Watermarks and reprocess, incremental MERGE showcases.
- Freshness dashboards and SLA alerts; owner for each table.
- Reconciliation by wallet/payments/crypto.
Performance and cost
- Partitioning and clustering; "small file" compaction.
- Materialized showcases for key reports.
- Autoscale/Auto-Parking, Retention Policy and Archives.
15) Red flags (anti-patterns)
BI and regulatory reports hit OLTP directly.
Bronze "rewrites" and loses raw data.
No watermarks, late events are "truncated."
No deduplication on 'idempotency _ key '/' event _ id' → duplicates in Gold.
PII and money from different regions are kept together without RLS and residency.
Schemes change "quietly" (without semver/contracts), breaking shop windows.
Millions of small uncompressed Parquet files → expensive requests.
No SLA/freshness dashboards; "surprises" in quarterly reporting.
16) Conclusion
Data Lake + DWH in iGaming is not just a storage, but a controlled ecosystem: standardized schemes and contracts, ACID-lakehouse, clear SLA freshness and late windows, quality and linearity, PII security and residency. Add reconciliation and partitioning/compaction savings - and you have a foundation for reporting, product solutions and business scaling without nightly migrations and manual Excel.