Blog/AI
◆ AI

Why we run Claude Haiku 4.5 in production.

A 60-day bake-off across 18,000 real dealer conversations. Four frontier models, five dimensions, one uncomfortable finding about listing copy.

DP
David Park
April 15, 2026 · 12 min read

Picking the production model for a consumer-facing chat product is the hardest call an applied-AI team makes. Get it wrong and you either burn money (big model, small-model problems) or leak deals (small model, big-model problems). Get it right and nobody notices — which is the goal.

We ran a 60-day bake-off from February 1 to April 1, 2026. 18,342 real conversations across 38 dealer sites. Four models in rotation, routed randomly per session. Every conversation tagged, scored, and reviewed.

Here's what we found, what it cost us, and what we shipped.

The contenders

The numbers

All costs include prompt caching (we cache the dealer's inventory + system prompt per session — averaged 78% cache hit rate). Latency is measured to first token, client-side, from a US-East dealer.

                          Haiku 4.5    Sonnet 4.6   Opus 4.6    GPT-4.1
                          ──────────   ──────────   ─────────   ─────────
Latency p50 (ms)          412          680          1,410       890
Latency p95 (ms)          840          1,290        3,050       1,870
Cost / 1k convos ($)      $1.84        $7.20        $34.60      $8.40
Qualification accuracy    91.2%        91.8%        92.4%       88.1%
Hallucination rate        0.6%         0.4%         0.3%        1.9%
Listing copy (blind win%) 38%          44%          45%         31%

The headline: Haiku 4.5 matched Sonnet 4.6 on lead qualification accuracy within the margin of error — at one-quarter of the cost and roughly half the latency. On hallucination, Sonnet was nominally better (0.4% vs 0.6%), but both were far below GPT-4.1 (1.9%, mostly financing-rate fabrications).

"Haiku 4.5 is the model that made real-time streaming to a boat buyer economically honest. At Sonnet pricing, we'd have had to rate-limit free chat. At Haiku pricing, we don't have to."

The uncomfortable finding

There was one task where Sonnet won clearly: listing copy. When we had all four models write 200-word marketing descriptions for 240 hulls and ran blind pairwise comparisons with actual buyers, Sonnet's copy was preferred 44% of the time versus 38% for Haiku.

That's a meaningful gap in isolation. Except: when we ran the same blind test with the dealer-principals who would be approving the copy, nobody could tell the difference. Dealer preference was a coin flip — 51/49 split, not significant at any n. The buyers slightly preferred Sonnet's slightly-warmer prose. The dealers didn't notice.

So we ship Haiku for listings too. Here's the honest trade-off we made: 6% fewer buyers prefer our listings, for 4× less cost. We redirected the difference into photography credits for dealers instead. Better photos move more boats than better prose.

What we didn't test for

Be skeptical of our numbers if you're using these models for different tasks. We did not test:

What the bake-off cost

The 60-day experiment cost us $11,800 in inference (mostly the Opus runs — we rotated Opus at 10% of traffic and it still dominated the bill). Plus about 140 hours of engineer time building the eval harness, labeling the qualification ground-truth set, and running the blind listing tests.

Expensive, yes. But at our scale — roughly 900,000 conversations/year across 42 dealers — picking Sonnet over Haiku would have cost us an extra $4,820/month in perpetuity. The bake-off paid for itself in the first three months.

The production setup

Here's what we actually run today:

About 4% of conversations hit the Sonnet escalation path. The other 96% stay on Haiku and nobody knows.


David Park leads AI at BoaterOS. Before that he ran the search-ranking team at Shopify for six years. He'll argue with you about evals on a whiteboard for free.

Keep reading
◆ Next step

Run BoaterOS at your dealership.

30-min demo on your inventory. See what the AI, the CRM, and the website look like running your lot next Monday.

Book a demo Read Fish Tale's case study