How to build a banner A/B testing system
One "lucky" banner doesn't make the system. The A/B testing system is a pipeline: brief → production of options → control of impressions → collection of correct metrics → statistics → solution → archive → scaling. Below is the minimum set of processes and artifacts for tests to be reproducible and profitable.
1) Goals and metrics: what we optimize
Dilute pre-click and post-click metrics - otherwise you will "tweak" CTR at the cost of junk traffic.
Pre-click:- Viewability.
- vCTR = clicks/visible impressions (main metric for creativity).
- Frequency and Reach (to control "fatigue").
- Placement-mix (platforms/formats).
- Landing CTR (first action), LPV/scroll, key event CVR.
- Time to first action, failure, lead/order quality.
- Down-funnel (if available): deposit/purchase/repeat.
- No promises of a "guaranteed result," respect for Responsible/Legal.
- Neutral CTAs ("View Terms," "Open Demo"), disclaimers where needed.
2) Experimental architecture: what the system consists of
1. Hypothesis rules (template): problem → idea → expected effect (MDE) → metrics → segments → risks.
2. Naming and versioning of files/codes:
2025-10_campaignX_geoUA_format-300x250_offer-A_cta-B_visual-C_v02. webp
3. Traffic routing table: placement → group A/B → share of display → exclusion.
4. Схема событий (tracking plan): impressions, viewable impressions, clicks, pageview, cta_click, form_start, form_error, submit, purchase.
5. Storage and preparation layer: raw logs → normalization (de-dup, anti-bot filters) → showcases.
6. Dashboards: pre-click, post-click, integral report on the experiment.
7. Decision archive: hypothesis → period → sample size → p-value/confidence interval → decision → rollout.
3) Design A/B: "pure" causality rules
Change 1 factor at a time (offer or visual or CTA).
Randomization by user rather than by display (cookie/uid) so that one person does not see both options in a session.
Stratification (by site/format/device) if they strongly affect vCTR.
Test = full weeks to cover seasonality by day.
Fix MDE (minimum detectable effect) before starting: for example, we want to capture + 8% to vCTR.
Stop condition: reached the required statistical power AND duration ≥ N days. Do not "peep" and do not stop early.
4) Pain-free stats
Sampling and Duration: The lower the baseline vCTR/CR and the lower the MDE, the more traffic and the longer the test.
Metric for solution: in creatives - more often vCTR, but the final solution is to raise to CR/CPA, if there is a post-click.
Always show confidence intervals in the report; avoid conclusions for 1-2 days.
Multisequences: if> 2 options, use Bonferroni/FDR plan, or test in pairs.
Sequential tests/early stops: Apply boundaries (e.g. O'Brien-Fleming) if the instrument can do it.
Bandits vs A/B: bandits are suitable for auto-exploitation of the winner with a stable target; for product insights, creative analytics and archives - classic A/B is more transparent.
5) Traffic quality control
Anti-bot filters: suspiciously high speed, clicks without viewability, abnormal user agent/IP.
Brand safety: site/keyword exclusions, negative playlist.
Geo/Device: Test in segments where you plan to scale.
Frequency capping: limit the frequency of display by user (for example, 3-5/day), otherwise "fatigue" will distort the result.
6) Rotation and "fatigue" of creatives
Fatigue threshold: a drop in vCTR by 30-40% with stable viewability and coverage - a signal to rotate.
Rotation calendar: check vCTR/placement trends every week; keep a pool of 6-12 variations (matrix offer × visual × CTA).
Result decomposition: store factor signs (offer, visual, cta, color, layout) in order to collect winners' "recipes" over time.
7) End-to-end process
1. Planning (Monday): Hypothesis Committee (Marketing + Design + Analyst). We select 2-4 hypotheses for a week.
2. Production (1-3 days): design packages for all formats, QA checklist (CTA contrast, weight, safe-zone, compliance).
3. Start: distribution of traffic 50/50 (or 33/33/33); fixing segments, enabling logs.
4. Monitoring: daily sanity check (without making decisions): share of impressions, viewability, bot flags.
5. Analysis (end of the week/upon reaching power): report at intervals, mobile/desktop subsamples, explanations.
6. Solution: winner - to operation, loser - to archive; we form the following hypothesis based on insights.
7. Archive: experiment card + creative files + sql query report + resume.
8) Data and dashboards: what to store and how to watch
Mini display case model (by day/creative/segment):
date, campaign, geo, device, placement, format, creative_id, offer, visual, cta, variant,
impressions, viewable_impressions, clicks, vctr, lp_sessions, cta_clicks, form_start, submit, purchases, bounce_rate, avg_scroll, time_to_first_action
Dashboards:
- Pre-click: viewability, vCTR, frequency, reach, placement cards.
- Post-click: CR by funnel pitch, lead/CPA quality.
- Experiments: ladder of confidence intervals, time to effect, wind rose of segments.
9) QA and launch checklist
- Formats: 300 × 250, 336 × 280, 300 × 600, 160 × 600, 728 × 90, 970 × 250; mobile 320 × 100/50, 1:1, 4:5, 16:9, 9:16
- Weight ≤ 150-200 KB (static/HTML5), WebP/PNG, without "heavy" GIFs
- CTA contrast (WCAG), safe zones (≥24 px from edge)
- No clickbait/promises, correct disclaimers
- Трекинг: viewable, click, lpview, cta_click, form_start, submit
- Randomization by user, clear proportion of A/B impressions
- Anti-bot filters enabled, placements exceptions configured
10) Hypothesis Library: What to Test
Offer:- "Transparent bonus terms" vs "All terms on one page"
- "Demo without registration" vs "View interface"
- "View Terms" vs "Learn Details"
- "Open Demo" vs "Try Now"
- Scene/hero vs screen interface vs iconography
- Warm background vs neutral; outline button vs fill
- Top-left logo vs compact; CTA right vs bottom
- Trust badge at CTA vs under headline
- Smooth fade-in PTC vs pulse CTA stroke (≤12 c, 2-3 phases)
11) Decision rules
Significance threshold: p≤0. 05 and/or whole confidence interval> 0 at the MDE landmark.
Common sense boundary: if there is a vCTR win, but the CR/CPA has sagged, we do not roll out.
Segment winners: if the difference is significant only on mobile/GEO - roll out targeted.
Ethics: we do not accept winnings at the cost of manipulative text/clickbait.
12) Anti-patterns (which breaks the system)
Many factors in one test → no conclusions.
Decisions "on schedule for 2 days."
Mixing channels (different audiences) in one experiment.
Lack of viewability → dead vCTR.
There is no archive of experiments → repetition of errors and the "eternal bicycle."
The frequency of impressions → fake victories due to "first attention" is not taken into account.
13) 30/60/90-implementation plan
0-30 Days - System MVP
Hypothesis template, naming, QA checklist.
Diagram of events and dashboard pre/post-click.
1-2 experiments: offer and CTA in a key format (300 × 250/320 × 100).
Enable viewability and anti-bot filters.
31-60 days - deepening
Expand to all formats and top placements; add HTML5 variants.
Implement rotation regulations and "fatigue" thresholds.
Introduce stratification by device/site, segment kickouts of winners.
61-90 days - maturity
Archive of experiments and factor base (offer/visual/cta).
Auto-questionnaire brief + semi-standard layouts (creative design system).
Monthly report: ROI of tests,% of winners, contribution to CR/CPA.
Pilot of bandits for auto-exploitation of winners in stable segments.
14) Mini templates (ready for copy paste)
Hypothesis template
Issue: vCTR low on mobile in GEO {X}
Idea: replace visual with scene with screen interface + CTA "Open demo"
MDE: +8% к vCTR
Metrics: vCTR (primary), CR (secondary), CPA (control)
Segments: mobile, formats 320 × 100/1: 1
Risks: post-click drop; event LP check
Totals card
A: vCTR 1. 22% [1. 15; 1. 29], CR 4. 1%
B: vCTR 1. 34% [1. 27; 1. 41], CR 4. 3%, CPA ↓ 6%
Decision: B won. Rollout: mobile GEO {X}, 100%
Comment: The effect is stronger on Y/Z placements
The A/B banner testing system is not a "button color," but a set of disciplines: correct metrics (viewability → vCTR → post-click), pure randomization, hard QA, traffic quality control, rotation regulations and transparent solutions. Build a pipeline of hypotheses, maintain an archive and factor base - and creativity will cease to be a lottery: you will consistently increase the effectiveness of advertising and reduce CPA in predictable steps.