Blog/AI
◆ AI

Measuring AI lift honestly.

Most AI case studies are marketing fiction. Here's the framework we use — randomization, channel segmentation, Bayesian priors — and the 83% lift that turned out to be 47%.

DP
David Park
March 4, 2026 · 10 min read

Read any AI-for-sales case study in the last 18 months and you'll see the same claim structure: "We added AI, conversions went up 143%." Nobody explains what the control group was, whether it even existed, or why the increase can't be explained by the fact that spring happens every year.

When we say BoaterOS delivers a 47% lift in qualified leads at Fish Tale, we need to be able to defend it in a room full of people paid to poke holes in claims. Here's exactly how we measure it, including the parts that are still hard.

The four rules

Rule 1: Randomize at session level, not page level. A buyer lands on the site. We flip a coin — control or treatment — and that coin persists for the entire session, including reloads, cookie-bound. Page-level randomization sounds fine until you realize a buyer shown AI on page 1 and not-AI on page 2 is neither fish nor fowl.

Rule 2: Segment by channel, not just globally. The buyer from a paid-search ad has different baseline intent than one from an organic blog post. Averaging across channels hides giant effect-size differences. We report lift per-channel (paid search, organic search, direct, referral, email) and separately report the weighted rollup.

Rule 3: Track the full funnel, not just the first step. Lift on "chat opened" is useless. Lift on "qualified lead" is better. Lift on "showing scheduled" is better still. Lift on "boat sold" is the thing we actually care about, but it's the slowest to converge. We report all four and weight our decisions toward the later stages.

Rule 4: Bayesian, not frequentist. Dealers have real priors. A dealer-principal who has been selling boats for 30 years has a meaningful prior on what a reasonable lift looks like (hint: it's not 400%). We encode those priors and update them. Frequentist p-values treat every experiment like the first experiment ever. That's not how a working dealership thinks.

"If your AI vendor can't show you a per-channel funnel cut of the lift, they didn't measure it — they guessed at it and hoped the number was big enough to round up."

The 83% that became 47%

When we first reported lift from a 30-day pilot at Fish Tale, the naïve comparison was +83% qualified leads vs. the pre-AI baseline. A very big number. Also a very wrong number.

Here's what the proper cut looked like:

ChannelNaïve liftSegmented lift95% CI
Paid search+112%+62%[48%, 78%]
Organic search+74%+41%[28%, 55%]
Direct+58%+28%[14%, 43%]
Referral+38%+19%[3%, 36%]
Weighted rollup+83%+47%[38%, 55%]

Where did the other 36 percentage points go? Two sources:

47% is the defensible number. 83% was the hopeful number. We report the defensible number.

How we run the test in practice

At experiment start, we randomly assign 50% of new sessions to treatment (AI Companion visible) and 50% to control (AI Companion hidden behind the old "Request Info" form). The assignment lives in a first-party cookie that expires at 90 days. The experiment is triple-blind in the sense that nobody on the dealer's sales floor knows which sessions were which — they see the leads land in the CRM the same way either way.

Every event in the funnel — page view, chat open, qualification, showing, close — is tagged with the assignment variant. We run a nightly Bayesian update on the posterior over lift for each stage. When the 90% credible interval on "qualified lead" lift is both (a) above zero and (b) at least 10 percentage points wide away from zero for five consecutive days, we call the test significant.

Typical time-to-significance for a 30,000-session/month dealer: 11-14 days.

What we can't measure honestly

Some things we don't claim to measure, because the measurement is bad enough that a claim would be dishonest:

Why we publish this

Because the bar for AI claims in our industry is on the floor, and we'd rather compete on the honest number.

If you're evaluating BoaterOS or any competitor, ask three questions: What was your control group? What was your randomization unit? Show me the lift by channel. If the answers are evasive, the number is a guess.


David Park leads AI at BoaterOS. He has a PhD in statistics and will talk your ear off about improper priors.

◆ Next step

Run BoaterOS at your dealership.

30-min demo on your inventory. See what the AI, the CRM, and the website look like running your lot next Monday.

Book a demo Read Fish Tale's case study