A/B tests of scoring rules
Scoring is the heart of any gamification. Exactly how points are counted determines player behavior, participation structure and economics (ARPPU, bonus bones). Below is a practical recipe for validly testing the new points rule and making sure that metric growth is real, not an artifact.
1) What exactly we are testing
Examples of rules:- By bet amount: 1 point for every €1 bet.
- By win/bet multiplier: points = ⌊ multiplier × k ⌋, with a cap per bet.
- Hybrid: points per turnover + boost for "series" (N spins in a row), caps per minute/hour.
- Missions: fix points for completing tasks (T1...Tn) with increasing complexity.
Hypothesis (example): "The multiplier + cap model will increase participation_net and completion rate without worsening Net ARPPU (after prizes/bonuses)."
2) Experimental unit and randomization
Unit: user (not session, not device).
Distribution: static hash (user_id → bucket) with fixed salts; fractions of 50/50 or 33/33/33 for A/B/C.
Stratification (recommended): payer-status (new paying/re-paying/non-paying), platform, geo.
Sticky-assignment: The user always sees the same rule during the test.
Test SRM (Sample Ratio Mismatch) - Check the actual shares of groups daily against the expected ones (chi-square). SRM - traffic leak signal, erroneous filtering, bugs.
3) Metrics and the "funnel of points"
Activity and participation
Reach: share who saw the event.
Participation_gross: entered/optional.
Participation_net: started progress/ideal.
Completion: completed/started.
Quality and money
ΔDAU/WAU и stickiness (DAU/WAU).
Avg Bets per Session, Avg Bet Size.
ARPPU (net) = ARPPU − (Prize + Bonus Cost per payer).
Avg Deposit, Paying Share.
Net Uplift: (additional revenue) − (prizes + bonuses + operating + fraud leaks).
Gardereils
Complaints/technical support for 1,000 users, KYC refusals, abnormal betting patterns, RG flags (limits, self-exclusion).
4) Duration, seasonality and novelty
Minimum 2 full business cycles (e.g. 2 weeks to capture the weekend).
Consider novelty-effect: a splash of the first 48-72 hours. Fix and analyze in phases (D0-D2, D3-D7, D8 +).
Do not cross with large promos, or plan "equal noise" in groups.
5) Sampling capacity and volume (calculation example)
Goal: to detect the difference in Δ by the average "points per user" (or Net ARPPU).
The formula for the two-sample t-test (equally in groups):[
n_{\text{na group}} =\frac {2, (z_{1-\alpha/2}+z_{1-\beta}) ^ 2 ,\sigma ^ 2} {\Delta ^ 2}
]
Example: we want to catch Δ = 5 points, σ = 120, α = 0.05 (two-sided), power 80% (β = 0.2).
(z_{1-α/2}=1{,}96), (z_{1-β}=0{,}84) → sum 2.8 → square 7.84.
(\sigma^2 = 14,400).
(n =\frac {2\times 7 {,} 84\times 14,400} {25 }\approx\frac {225,792} {25 }\approx 9,032) per group.
6) Reducing dispersion: making the test "cheaper"
CUPED: regression adjustment for pre-test covariates (for example, points/rates for the past week).
Covariates: payer-flag, log-transformations of turnover, activity, platform, geo.
Error clustering: at the user level (repeated sessions within).
7) Interference and "straits"
The points rule can affect more than just test participants:- Social comparison (common leader board) → "spillover."
- Shared jackpots/joint missions → cross effect.
- Separate leaderboards by group or hidden normalization of points.
- Cluster randomization by traffic/geo clusters (more expensive but cleaner).
- Per-protocol (ITT) + sensitive assays.
8) Antifraud and mouthfuls of rules
Any changes in glasses stimulate optimization: micro-bets, botovodstvo, "glasses farms."
Minimum protections:- Cap points per minute/hour/day and for one bet.
- Minimum volatility of bets (prohibition of "ideal" sequences).
- Detection of headless/repeated fingerprints, proxy.
- Delayed verification of large prizes + KYC.
- Analytics: Compare "points/bets" and "points/min" distribution, look for tails.
9) Events and data scheme (minimum)
Events:- `session_start {user_id, ts, platform}`
- `event_view {user_id, event_id, ts}`
- `event_join {user_id, event_id, ts}`
- `points_awarded {user_id, event_id, rule_id, amount, source, ts}`
- `mission_progress {user_id, mission_id, step, value, ts}`
- `mission_complete {user_id, mission_id, ts}`
- `bet {user_id, game_id, bet, win, ts}`
- `deposit {user_id, amount, ts}`
- `rules {rule_id, name, params, caps_minute, caps_hour, caps_day, version}`
- `assignments {user_id, test_id, group, assigned_at}`
10) SQL sketches for analysis
SRM check (group allocation):sql
SELECT group, COUNT() AS users
FROM assignments
WHERE test_id =:test
GROUP BY group;
-- further chi-squared versus expected fractions
Participation/Completion by group:
sql
WITH eligible AS (
SELECT user_id FROM users
WHERE last_active_at >=:start - INTERVAL '14 day'
), joined AS (
SELECT DISTINCT user_id FROM event_join
WHERE event_id =:event AND ts BETWEEN:start AND:end
), started AS (
SELECT DISTINCT user_id FROM mission_progress
WHERE ts BETWEEN:start AND:end AND mission_id IN (:missions)
), completed AS (
SELECT DISTINCT user_id FROM mission_complete
WHERE ts BETWEEN:start AND:end AND mission_id IN (:missions)
)
SELECT a. group, COUNT(DISTINCT j. user_id)::float/COUNT(DISTINCT e. user_id) AS participation_gross, COUNT(DISTINCT s. user_id)::float/COUNT(DISTINCT e. user_id) AS participation_net, COUNT(DISTINCT c. user_id)::float/NULLIF(COUNT(DISTINCT s. user_id),0) AS completion
FROM eligible e
JOIN assignments a USING (user_id)
LEFT JOIN joined j USING (user_id)
LEFT JOIN started s USING (user_id)
LEFT JOIN completed c USING (user_id)
WHERE a. test_id =:test
GROUP BY a. group;
Net ARPPU and value of prizes/bonuses:
sql
WITH payors AS (
SELECT DISTINCT user_id FROM payments
WHERE ts BETWEEN:start AND:end
), rev AS (
SELECT user_id, SUM(ggr) AS ggr
FROM revenue
WHERE ts BETWEEN:start AND:end
GROUP BY user_id
), costs AS (
SELECT user_id, SUM(prize + bonus) AS cost
FROM promo_costs
WHERE ts BETWEEN:start AND:end
GROUP BY user_id
)
SELECT a. group, AVG(COALESCE(r. ggr,0) - COALESCE(c. cost,0)) FILTER (WHERE p. user_id IS NOT NULL) AS net_arppu
FROM assignments a
LEFT JOIN payors p USING (user_id)
LEFT JOIN rev r USING (user_id)
LEFT JOIN costs c USING (user_id)
WHERE a. test_id =:test
GROUP BY a. group;
CAPPED (example):
sql
-- pre_value: glasses/pre-test revenue; value: during test
SELECT group, AVG(value - theta pre_value) AS cuped_mean
FROM (
SELECT a. group, x.user_id, x.value, x.pre_value, (SELECT COVAR_SAMP(value, pre_value)/VAR_SAMP(pre_value)
FROM x) AS theta
FROM assignments a
JOIN x ON x.user_id = a. user_id
WHERE a. test_id =:test
) t
GROUP BY group;
11) Partial effects and heterogeneity
Check the HET effects:- Beginners vs core, low-value vs high-value, different platforms/geo.
- Sometimes the new formula of glasses "lights up" mid-core without changing the whales - this is the desired outcome.
- Pre-register segments so as not to catch "p-hacking."
12) Frequent traps
1. A common leader board for all groups → interference.
2. Changing the prize structure during the test → incomparability.
3. Micro-betting of glasses → invalid uplift.
4. SRM and floating filters in ETL → biased estimates.
5. Reliance on "dirty" ARPPU without deducting prizes/bonuses.
6. Early stop due to fluctuations without correct sequential statistics.
13) Bayes vs frequency and sequential decisions
Framework: You can use the Bayesian approach (posterior difference in metrics, probability "B is better than A"), especially when monitoring over time.
Caution: bandits for the rules of glasses are appropriate after a confirmed uplift - at the stage of operation, not at the initial validation.
14) Responsible play and compliance
Transparent rules and caps: the player must understand how he earns points.
Activity and deposit limits, "pauses" and RG prompts.
No hidden "penalties" for the style of play.
15) Mini Case (Synthetic)
Context: weekly event, A = "points for €1 bet," B = "points by win/bet multiplier, cap = 50/bet."
Size: 2 × 10,000 users, stratified by payer status. SRM - approx.
Results:- Participation_net: A 17,3% → B 22,1% (+4,8 п.п.).
- Completion: A 38,9% → B 44,0% (+5,1 п.п.).
- Net ARPPU: A €41.2 → B €43.5 (+ €2.3) with Prize + Bonus per payer ≈ €6.4 (unchanged).
- Complaints/1k: unchanged; fraud flags ↓0,3 pp due to caps.
- Conclusion: Rule B - winner; we scale with the "long tail" of prizes and save the mouthguards.
16) Points A/B Launch Checklist
- Unit = user, sticky-assignment, stratification.
- Separate leadboards/normalization to remove interference.
- Clear caps for glasses, anti-bot signals, KYC to major winners.
- Pre-registration of hypotheses and metrics (primary/secondary/guardrails).
- Capacity and duration plan, seasonality taken into account.
- CUPED/covariates connected, pipeline SRM-alerts.
- Дашборд «Reach → Participation → Progress → Completion → Value».
- Report: increment in money after prizes/bonuses, tail post-effect.
The scoring rule is a lever of behavior. A correctly designed A/B test (without SRM, with anti-fraud and covariates) allows you to safely increase participation, completion and Net ARPPU, while maintaining player confidence and campaign economics.