A/B tests of scoring rules

Scoring is the heart of any gamification. Exactly how points are counted determines player behavior, participation structure and economics (ARPPU, bonus bones). Below is a practical recipe for validly testing the new points rule and making sure that metric growth is real, not an artifact.

1) What exactly we are testing

Examples of rules:

By bet amount: 1 point for every €1 bet.
By win/bet multiplier: points = ⌊ multiplier × k ⌋, with a cap per bet.
Hybrid: points per turnover + boost for "series" (N spins in a row), caps per minute/hour.
Missions: fix points for completing tasks (T1...Tn) with increasing complexity.

Hypothesis (example): "The multiplier + cap model will increase participation_net and completion rate without worsening Net ARPPU (after prizes/bonuses)."

2) Experimental unit and randomization

Unit: user (not session, not device).

Distribution: static hash (user_id → bucket) with fixed salts; fractions of 50/50 or 33/33/33 for A/B/C.

Stratification (recommended): payer-status (new paying/re-paying/non-paying), platform, geo.

Sticky-assignment: The user always sees the same rule during the test.

Test SRM (Sample Ratio Mismatch) - Check the actual shares of groups daily against the expected ones (chi-square). SRM - traffic leak signal, erroneous filtering, bugs.

3) Metrics and the "funnel of points"

Activity and participation

Reach: share who saw the event.

Participation_gross: entered/optional.

Participation_net: started progress/ideal.

Completion: completed/started.

Quality and money

ΔDAU/WAU и stickiness (DAU/WAU).

Avg Bets per Session, Avg Bet Size.

ARPPU (net) = ARPPU − (Prize + Bonus Cost per payer).

Avg Deposit, Paying Share.

Net Uplift: (additional revenue) − (prizes + bonuses + operating + fraud leaks).

Gardereils

Complaints/technical support for 1,000 users, KYC refusals, abnormal betting patterns, RG flags (limits, self-exclusion).

4) Duration, seasonality and novelty

Minimum 2 full business cycles (e.g. 2 weeks to capture the weekend).

Consider novelty-effect: a splash of the first 48-72 hours. Fix and analyze in phases (D0-D2, D3-D7, D8 +).

Do not cross with large promos, or plan "equal noise" in groups.

5) Sampling capacity and volume (calculation example)

Goal: to detect the difference in Δ by the average "points per user" (or Net ARPPU).

The formula for the two-sample t-test (equally in groups):

[
n_{\text{na group}} =\frac {2, (z_{1-\alpha/2}+z_{1-\beta}) ^ 2 ,\sigma ^ 2} {\Delta ^ 2}
]

Example: we want to catch Δ = 5 points, σ = 120, α = 0.05 (two-sided), power 80% (β = 0.2).

(z_{1-α/2}=1{,}96), (z_{1-β}=0{,}84) → sum 2.8 → square 7.84.

(\sigma^2 = 14,400).

(n =\frac {2\times 7 {,} 84\times 14,400} {25 }\approx\frac {225,792} {25 }\approx 9,032) per group.

💡 Total: ~ 9,100 users per group (with a margin for rejection and antifraud).

6) Reducing dispersion: making the test "cheaper"

CUPED: regression adjustment for pre-test covariates (for example, points/rates for the past week).

Covariates: payer-flag, log-transformations of turnover, activity, platform, geo.

Error clustering: at the user level (repeated sessions within).

7) Interference and "straits"

The points rule can affect more than just test participants:

Social comparison (common leader board) → "spillover."
Shared jackpots/joint missions → cross effect.

Solutions:

Separate leaderboards by group or hidden normalization of points.
Cluster randomization by traffic/geo clusters (more expensive but cleaner).
Per-protocol (ITT) + sensitive assays.

8) Antifraud and mouthfuls of rules

Any changes in glasses stimulate optimization: micro-bets, botovodstvo, "glasses farms."

Minimum protections:

Cap points per minute/hour/day and for one bet.
Minimum volatility of bets (prohibition of "ideal" sequences).
Detection of headless/repeated fingerprints, proxy.
Delayed verification of large prizes + KYC.
Analytics: Compare "points/bets" and "points/min" distribution, look for tails.

9) Events and data scheme (minimum)

Events:

`session_start {user_id, ts, platform}`
`event_view {user_id, event_id, ts}`
`event_join {user_id, event_id, ts}`
`points_awarded {user_id, event_id, rule_id, amount, source, ts}`
`mission_progress {user_id, mission_id, step, value, ts}`
`mission_complete {user_id, mission_id, ts}`
`bet {user_id, game_id, bet, win, ts}`
`deposit {user_id, amount, ts}`

Reference books:

`rules {rule_id, name, params, caps_minute, caps_hour, caps_day, version}`
`assignments {user_id, test_id, group, assigned_at}`

10) SQL sketches for analysis

SRM check (group allocation):

sql
SELECT group, COUNT() AS users
FROM assignments
WHERE test_id =:test
GROUP BY group;
-- further chi-squared versus expected fractions

Participation/Completion by group:

sql
WITH eligible AS (
SELECT user_id FROM users
WHERE last_active_at >=:start - INTERVAL '14 day'
), joined AS (
SELECT DISTINCT user_id FROM event_join
WHERE event_id =:event AND ts BETWEEN:start AND:end
), started AS (
SELECT DISTINCT user_id FROM mission_progress
WHERE ts BETWEEN:start AND:end AND mission_id IN (:missions)
), completed AS (
SELECT DISTINCT user_id FROM mission_complete
WHERE ts BETWEEN:start AND:end AND mission_id IN (:missions)
)
SELECT a. group,  COUNT(DISTINCT j. user_id)::float/COUNT(DISTINCT e. user_id) AS participation_gross,  COUNT(DISTINCT s. user_id)::float/COUNT(DISTINCT e. user_id) AS participation_net,  COUNT(DISTINCT c. user_id)::float/NULLIF(COUNT(DISTINCT s. user_id),0) AS completion
FROM eligible e
JOIN assignments a USING (user_id)
LEFT JOIN joined j USING (user_id)
LEFT JOIN started s USING (user_id)
LEFT JOIN completed c USING (user_id)
WHERE a. test_id =:test
GROUP BY a. group;

Net ARPPU and value of prizes/bonuses:

sql
WITH payors AS (
SELECT DISTINCT user_id FROM payments
WHERE ts BETWEEN:start AND:end
), rev AS (
SELECT user_id, SUM(ggr) AS ggr
FROM revenue
WHERE ts BETWEEN:start AND:end
GROUP BY user_id
), costs AS (
SELECT user_id, SUM(prize + bonus) AS cost
FROM promo_costs
WHERE ts BETWEEN:start AND:end
GROUP BY user_id
)
SELECT a. group,  AVG(COALESCE(r. ggr,0) - COALESCE(c. cost,0)) FILTER (WHERE p. user_id IS NOT NULL) AS net_arppu
FROM assignments a
LEFT JOIN payors p USING (user_id)
LEFT JOIN rev r USING (user_id)
LEFT JOIN costs c USING (user_id)
WHERE a. test_id =:test
GROUP BY a. group;

CAPPED (example):

sql
-- pre_value: glasses/pre-test revenue; value: during test
SELECT group,    AVG(value - theta pre_value) AS cuped_mean
FROM (
SELECT a. group, x.user_id, x.value, x.pre_value,     (SELECT COVAR_SAMP(value, pre_value)/VAR_SAMP(pre_value)
FROM x) AS theta
FROM assignments a
JOIN x ON x.user_id = a. user_id
WHERE a. test_id =:test
) t
GROUP BY group;

11) Partial effects and heterogeneity

Check the HET effects:

Beginners vs core, low-value vs high-value, different platforms/geo.
Sometimes the new formula of glasses "lights up" mid-core without changing the whales - this is the desired outcome.
Pre-register segments so as not to catch "p-hacking."

12) Frequent traps

1. A common leader board for all groups → interference.

2. Changing the prize structure during the test → incomparability.

3. Micro-betting of glasses → invalid uplift.

4. SRM and floating filters in ETL → biased estimates.

5. Reliance on "dirty" ARPPU without deducting prizes/bonuses.

6. Early stop due to fluctuations without correct sequential statistics.

13) Bayes vs frequency and sequential decisions

Framework: You can use the Bayesian approach (posterior difference in metrics, probability "B is better than A"), especially when monitoring over time.

Caution: bandits for the rules of glasses are appropriate after a confirmed uplift - at the stage of operation, not at the initial validation.

14) Responsible play and compliance

Transparent rules and caps: the player must understand how he earns points.

Activity and deposit limits, "pauses" and RG prompts.

No hidden "penalties" for the style of play.

15) Mini Case (Synthetic)

Context: weekly event, A = "points for €1 bet," B = "points by win/bet multiplier, cap = 50/bet."

Size: 2 × 10,000 users, stratified by payer status. SRM - approx.

Results:

Participation_net: A 17,3% → B 22,1% (+4,8 п.п.).
Completion: A 38,9% → B 44,0% (+5,1 п.п.).
Net ARPPU: A €41.2 → B €43.5 (+ €2.3) with Prize + Bonus per payer ≈ €6.4 (unchanged).
Complaints/1k: unchanged; fraud flags ↓0,3 pp due to caps.
Conclusion: Rule B - winner; we scale with the "long tail" of prizes and save the mouthguards.

16) Points A/B Launch Checklist

Unit = user, sticky-assignment, stratification.
Separate leadboards/normalization to remove interference.
Clear caps for glasses, anti-bot signals, KYC to major winners.
Pre-registration of hypotheses and metrics (primary/secondary/guardrails).
Capacity and duration plan, seasonality taken into account.
CUPED/covariates connected, pipeline SRM-alerts.
Дашборд «Reach → Participation → Progress → Completion → Value».
Report: increment in money after prizes/bonuses, tail post-effect.

The scoring rule is a lever of behavior. A correctly designed A/B test (without SRM, with anti-fraud and covariates) allows you to safely increase participation, completion and Net ARPPU, while maintaining player confidence and campaign economics.