How to build a banner A/B testing system

One "lucky" banner doesn't make the system. The A/B testing system is a pipeline: brief → production of options → control of impressions → collection of correct metrics → statistics → solution → archive → scaling. Below is the minimum set of processes and artifacts for tests to be reproducible and profitable.

1) Goals and metrics: what we optimize

Dilute pre-click and post-click metrics - otherwise you will "tweak" CTR at the cost of junk traffic.

Pre-click:

Viewability.
vCTR = clicks/visible impressions (main metric for creativity).
Frequency and Reach (to control "fatigue").
Placement-mix (platforms/formats).

Post-click:

Landing CTR (first action), LPV/scroll, key event CVR.
Time to first action, failure, lead/order quality.
Down-funnel (if available): deposit/purchase/repeat.

Restrictions/policies (YMYL/gambling, fintech, etc.):

No promises of a "guaranteed result," respect for Responsible/Legal.
Neutral CTAs ("View Terms," "Open Demo"), disclaimers where needed.

2) Experimental architecture: what the system consists of

1. Hypothesis rules (template): problem → idea → expected effect (MDE) → metrics → segments → risks.

2. Naming and versioning of files/codes:


2025-10_campaignX_geoUA_format-300x250_offer-A_cta-B_visual-C_v02. webp

3. Traffic routing table: placement → group A/B → share of display → exclusion.

4. Схема событий (tracking plan): impressions, viewable impressions, clicks, pageview, cta_click, form_start, form_error, submit, purchase.

5. Storage and preparation layer: raw logs → normalization (de-dup, anti-bot filters) → showcases.

6. Dashboards: pre-click, post-click, integral report on the experiment.

7. Decision archive: hypothesis → period → sample size → p-value/confidence interval → decision → rollout.

3) Design A/B: "pure" causality rules

Change 1 factor at a time (offer or visual or CTA).

Randomization by user rather than by display (cookie/uid) so that one person does not see both options in a session.

Stratification (by site/format/device) if they strongly affect vCTR.

Test = full weeks to cover seasonality by day.

Fix MDE (minimum detectable effect) before starting: for example, we want to capture + 8% to vCTR.

Stop condition: reached the required statistical power AND duration ≥ N days. Do not "peep" and do not stop early.

4) Pain-free stats

Sampling and Duration: The lower the baseline vCTR/CR and the lower the MDE, the more traffic and the longer the test.

Metric for solution: in creatives - more often vCTR, but the final solution is to raise to CR/CPA, if there is a post-click.

Always show confidence intervals in the report; avoid conclusions for 1-2 days.

Multisequences: if> 2 options, use Bonferroni/FDR plan, or test in pairs.

Sequential tests/early stops: Apply boundaries (e.g. O'Brien-Fleming) if the instrument can do it.

Bandits vs A/B: bandits are suitable for auto-exploitation of the winner with a stable target; for product insights, creative analytics and archives - classic A/B is more transparent.

5) Traffic quality control

Anti-bot filters: suspiciously high speed, clicks without viewability, abnormal user agent/IP.

Brand safety: site/keyword exclusions, negative playlist.

Geo/Device: Test in segments where you plan to scale.

Frequency capping: limit the frequency of display by user (for example, 3-5/day), otherwise "fatigue" will distort the result.

6) Rotation and "fatigue" of creatives

Fatigue threshold: a drop in vCTR by 30-40% with stable viewability and coverage - a signal to rotate.

Rotation calendar: check vCTR/placement trends every week; keep a pool of 6-12 variations (matrix offer × visual × CTA).

Result decomposition: store factor signs (offer, visual, cta, color, layout) in order to collect winners' "recipes" over time.

7) End-to-end process

1. Planning (Monday): Hypothesis Committee (Marketing + Design + Analyst). We select 2-4 hypotheses for a week.

2. Production (1-3 days): design packages for all formats, QA checklist (CTA contrast, weight, safe-zone, compliance).

3. Start: distribution of traffic 50/50 (or 33/33/33); fixing segments, enabling logs.

4. Monitoring: daily sanity check (without making decisions): share of impressions, viewability, bot flags.

5. Analysis (end of the week/upon reaching power): report at intervals, mobile/desktop subsamples, explanations.

6. Solution: winner - to operation, loser - to archive; we form the following hypothesis based on insights.

7. Archive: experiment card + creative files + sql query report + resume.

8) Data and dashboards: what to store and how to watch

Mini display case model (by day/creative/segment):


date, campaign, geo, device, placement, format, creative_id, offer, visual, cta, variant,
impressions, viewable_impressions, clicks, vctr, lp_sessions, cta_clicks, form_start, submit, purchases, bounce_rate, avg_scroll, time_to_first_action

Dashboards:

Pre-click: viewability, vCTR, frequency, reach, placement cards.
Post-click: CR by funnel pitch, lead/CPA quality.
Experiments: ladder of confidence intervals, time to effect, wind rose of segments.

9) QA and launch checklist

Formats: 300 × 250, 336 × 280, 300 × 600, 160 × 600, 728 × 90, 970 × 250; mobile 320 × 100/50, 1:1, 4:5, 16:9, 9:16
Weight ≤ 150-200 KB (static/HTML5), WebP/PNG, without "heavy" GIFs
CTA contrast (WCAG), safe zones (≥24 px from edge)
No clickbait/promises, correct disclaimers
Трекинг: viewable, click, lpview, cta_click, form_start, submit
Randomization by user, clear proportion of A/B impressions
Anti-bot filters enabled, placements exceptions configured

10) Hypothesis Library: What to Test

Offer:

"Transparent bonus terms" vs "All terms on one page"
"Demo without registration" vs "View interface"

CTA:

"View Terms" vs "Learn Details"
"Open Demo" vs "Try Now"

Visual:

Scene/hero vs screen interface vs iconography
Warm background vs neutral; outline button vs fill

Composition:

Top-left logo vs compact; CTA right vs bottom
Trust badge at CTA vs under headline

Micro-motion (HTML5):

Smooth fade-in PTC vs pulse CTA stroke (≤12 c, 2-3 phases)

11) Decision rules

Significance threshold: p≤0. 05 and/or whole confidence interval> 0 at the MDE landmark.

Common sense boundary: if there is a vCTR win, but the CR/CPA has sagged, we do not roll out.

Segment winners: if the difference is significant only on mobile/GEO - roll out targeted.

Ethics: we do not accept winnings at the cost of manipulative text/clickbait.

12) Anti-patterns (which breaks the system)

Many factors in one test → no conclusions.

Decisions "on schedule for 2 days."

Mixing channels (different audiences) in one experiment.

Lack of viewability → dead vCTR.

There is no archive of experiments → repetition of errors and the "eternal bicycle."

The frequency of impressions → fake victories due to "first attention" is not taken into account.

13) 30/60/90-implementation plan

0-30 Days - System MVP

Hypothesis template, naming, QA checklist.

Diagram of events and dashboard pre/post-click.

1-2 experiments: offer and CTA in a key format (300 × 250/320 × 100).

Enable viewability and anti-bot filters.

31-60 days - deepening

Expand to all formats and top placements; add HTML5 variants.

Implement rotation regulations and "fatigue" thresholds.

Introduce stratification by device/site, segment kickouts of winners.

61-90 days - maturity

Archive of experiments and factor base (offer/visual/cta).

Auto-questionnaire brief + semi-standard layouts (creative design system).

Monthly report: ROI of tests,% of winners, contribution to CR/CPA.

Pilot of bandits for auto-exploitation of winners in stable segments.

14) Mini templates (ready for copy paste)

Hypothesis template


Issue: vCTR low on mobile in GEO {X}
Idea: replace visual with scene with screen interface + CTA "Open demo"
MDE: +8% к vCTR
Metrics: vCTR (primary), CR (secondary), CPA (control)
Segments: mobile, formats 320 × 100/1: 1
Risks: post-click drop; event LP check

Totals card


A: vCTR 1. 22% [1. 15; 1. 29], CR 4. 1%
B: vCTR 1. 34% [1. 27; 1. 41], CR 4. 3%, CPA ↓ 6%
Decision: B won. Rollout: mobile GEO {X}, 100%
Comment: The effect is stronger on Y/Z placements

The A/B banner testing system is not a "button color," but a set of disciplines: correct metrics (viewability → vCTR → post-click), pure randomization, hard QA, traffic quality control, rotation regulations and transparent solutions. Build a pipeline of hypotheses, maintain an archive and factor base - and creativity will cease to be a lottery: you will consistently increase the effectiveness of advertising and reduce CPA in predictable steps.