Picking the production model for a consumer-facing chat product is the hardest call an applied-AI team makes. Get it wrong and you either burn money (big model, small-model problems) or leak deals (small model, big-model problems). Get it right and nobody notices — which is the goal.
We ran a 60-day bake-off from February 1 to April 1, 2026. 18,342 real conversations across 38 dealer sites. Four models in rotation, routed randomly per session. Every conversation tagged, scored, and reviewed.
Here's what we found, what it cost us, and what we shipped.
The contenders
- Claude Haiku 4.5 — Anthropic's small-tier, our incumbent since Jan 2026.
- Claude Sonnet 4.6 — Anthropic's mid-tier. The "obvious upgrade" hypothesis.
- Claude Opus 4.6 — Anthropic's flagship. The "surely this is better" test.
- GPT-4.1 — OpenAI's equivalent mid-tier. Control for vendor bias.
The numbers
All costs include prompt caching (we cache the dealer's inventory + system prompt per session — averaged 78% cache hit rate). Latency is measured to first token, client-side, from a US-East dealer.
The headline: Haiku 4.5 matched Sonnet 4.6 on lead qualification accuracy within the margin of error — at one-quarter of the cost and roughly half the latency. On hallucination, Sonnet was nominally better (0.4% vs 0.6%), but both were far below GPT-4.1 (1.9%, mostly financing-rate fabrications).
"Haiku 4.5 is the model that made real-time streaming to a boat buyer economically honest. At Sonnet pricing, we'd have had to rate-limit free chat. At Haiku pricing, we don't have to."
The uncomfortable finding
There was one task where Sonnet won clearly: listing copy. When we had all four models write 200-word marketing descriptions for 240 hulls and ran blind pairwise comparisons with actual buyers, Sonnet's copy was preferred 44% of the time versus 38% for Haiku.
That's a meaningful gap in isolation. Except: when we ran the same blind test with the dealer-principals who would be approving the copy, nobody could tell the difference. Dealer preference was a coin flip — 51/49 split, not significant at any n. The buyers slightly preferred Sonnet's slightly-warmer prose. The dealers didn't notice.
So we ship Haiku for listings too. Here's the honest trade-off we made: 6% fewer buyers prefer our listings, for 4× less cost. We redirected the difference into photography credits for dealers instead. Better photos move more boats than better prose.
What we didn't test for
Be skeptical of our numbers if you're using these models for different tasks. We did not test:
- Long-form reasoning (we cap context at 8K tokens in production — nobody needs a 200K-token buyer chat)
- Multilingual performance (we serve English-speaking US/Canadian dealers, full stop)
- Code generation (none of these models write code that runs on a dealer's behalf)
- Tool-use with 12+ tools (we run a focused 4-tool setup: inventory search, financing calculator, CRM write, calendar book)
What the bake-off cost
The 60-day experiment cost us $11,800 in inference (mostly the Opus runs — we rotated Opus at 10% of traffic and it still dominated the bill). Plus about 140 hours of engineer time building the eval harness, labeling the qualification ground-truth set, and running the blind listing tests.
Expensive, yes. But at our scale — roughly 900,000 conversations/year across 42 dealers — picking Sonnet over Haiku would have cost us an extra $4,820/month in perpetuity. The bake-off paid for itself in the first three months.
The production setup
Here's what we actually run today:
- Default: Haiku 4.5 with prompt caching + streaming
- Escalation: Sonnet 4.6 for conversations that cross 15 turns or include "compare X to Y" across 3+ boats
- Fallback: Circuit breaker to GPT-4.1 on Anthropic outages (happened twice in 60 days, once for 11 minutes)
About 4% of conversations hit the Sonnet escalation path. The other 96% stay on Haiku and nobody knows.
David Park leads AI at BoaterOS. Before that he ran the search-ranking team at Shopify for six years. He'll argue with you about evals on a whiteboard for free.