Measuring AI Lift Honestly: Our A/B Framework

Read any AI-for-sales case study in the last 18 months and you'll see the same claim structure: "We added AI, conversions went up 143%." Nobody explains what the control group was, whether it even existed, or why the increase can't be explained by the fact that spring happens every year.

When we say BoaterOS delivers a 47% lift in qualified leads at Fish Tale, we need to be able to defend it in a room full of people paid to poke holes in claims. Here's exactly how we measure it, including the parts that are still hard.

The four rules

Rule 1: Randomize at session level, not page level. A buyer lands on the site. We flip a coin — control or treatment — and that coin persists for the entire session, including reloads, cookie-bound. Page-level randomization sounds fine until you realize a buyer shown AI on page 1 and not-AI on page 2 is neither fish nor fowl.

Rule 2: Segment by channel, not just globally. The buyer from a paid-search ad has different baseline intent than one from an organic blog post. Averaging across channels hides giant effect-size differences. We report lift per-channel (paid search, organic search, direct, referral, email) and separately report the weighted rollup.

Rule 3: Track the full funnel, not just the first step. Lift on "chat opened" is useless. Lift on "qualified lead" is better. Lift on "showing scheduled" is better still. Lift on "boat sold" is the thing we actually care about, but it's the slowest to converge. We report all four and weight our decisions toward the later stages.

Rule 4: Bayesian, not frequentist. Dealers have real priors. A dealer-principal who has been selling boats for 30 years has a meaningful prior on what a reasonable lift looks like (hint: it's not 400%). We encode those priors and update them. Frequentist p-values treat every experiment like the first experiment ever. That's not how a working dealership thinks.

"If your AI vendor can't show you a per-channel funnel cut of the lift, they didn't measure it — they guessed at it and hoped the number was big enough to round up."

The 83% that became 47%

When we first reported lift from a 30-day pilot at Fish Tale, the naïve comparison was +83% qualified leads vs. the pre-AI baseline. A very big number. Also a very wrong number.

Here's what the proper cut looked like:

Channel	Naïve lift	Segmented lift	95% CI
Paid search	+112%	+62%	[48%, 78%]
Organic search	+74%	+41%	[28%, 55%]
Direct	+58%	+28%	[14%, 43%]
Referral	+38%	+19%	[3%, 36%]
Weighted rollup	+83%	+47%	[38%, 55%]

Where did the other 36 percentage points go? Two sources:

Seasonality. The pilot ran March-April. Intent was already climbing. Some of the lift was spring, not AI.
Traffic-mix drift. Fish Tale concurrently ran a paid-search push. Paid traffic over-indexes on intent, so the treated cohort ended up being richer-intent than the baseline comparison period. Segmenting by channel neutralizes this.

47% is the defensible number. 83% was the hopeful number. We report the defensible number.

How we run the test in practice

At experiment start, we randomly assign 50% of new sessions to treatment (AI Companion visible) and 50% to control (AI Companion hidden behind the old "Request Info" form). The assignment lives in a first-party cookie that expires at 90 days. The experiment is triple-blind in the sense that nobody on the dealer's sales floor knows which sessions were which — they see the leads land in the CRM the same way either way.

Every event in the funnel — page view, chat open, qualification, showing, close — is tagged with the assignment variant. We run a nightly Bayesian update on the posterior over lift for each stage. When the 90% credible interval on "qualified lead" lift is both (a) above zero and (b) at least 10 percentage points wide away from zero for five consecutive days, we call the test significant.

Typical time-to-significance for a 30,000-session/month dealer: 11-14 days.

What we can't measure honestly

Some things we don't claim to measure, because the measurement is bad enough that a claim would be dishonest:

Brand lift. Does having a sharper AI make a buyer more likely to mention you to a friend six months later? Probably yes. We can't measure it, and we don't claim to.
Word-of-mouth. Anecdotally, every dealer-principal reports more "my buddy Bob told me to check you out" stories after rolling out BoaterOS. Anecdote isn't data.
Close-rate lift from AI-primed buyers. Buyers who spent 12 minutes with the AI before calling close at a visibly higher rate. But the confound with buyer intent is enormous — did the AI close them, or were they already going to close? We can measure it, we can't cleanly attribute it. So we don't report it.

Why we publish this

Because the bar for AI claims in our industry is on the floor, and we'd rather compete on the honest number.

If you're evaluating BoaterOS or any competitor, ask three questions: What was your control group? What was your randomization unit? Show me the lift by channel. If the answers are evasive, the number is a guess.

David Park leads AI at BoaterOS. He has a PhD in statistics and will talk your ear off about improper priors.

Measuring AI lift honestly.

The four rules

The 83% that became 47%

How we run the test in practice

What we can't measure honestly

Why we publish this

Run BoaterOS at your dealership.