🚤 Marine Bench

Can AI answer 25 questions about Regulator Marine boats?

The Experiment

We asked 6 top LLMs to answer 25 specific questions from a Regulator Marine product catalog.

📊 Results

Model Score Accuracy Performance
Grok 4.1 Fast 1/25 4%
Gemini 3 Flash 3/25 12%
Claude Opus 4.5 5/25 20%
DeepSeek v3.2 5/25 20%
GPT-5.2 (Instant) 8/25 32%
Gemini 3 Flash + Search 8/25 32%
Gemini 3 Pro 11/25 44%
🎯 With RAG Context 25/25 100%

🤣 Best Hallucinations

💀 Grok invented a boat

  • Confidently described the "Regulator 34" — a model that doesn't exist

🚀 Claude thinks boats are rockets

  • Said 0-30 mph in 4-5 seconds (actual: 10.53 seconds)
  • Hull color option: $18-22k (actual: $5,195) — 4x wrong

⚖️ DeepSeek added 50% weight

  • Dry weight: 7,200 lbs (actual: 4,850 lbs)
  • Seakeeper price: $26,500 (actual: $54,765) — half off!

💡 Key Insights

44%

Best model without context. Even Gemini 3 Pro fails more than half the questions.

100%

Any model with RAG. Context quality matters more than model choice.

32%

Web search isn't enough. Generic web data ≠ your proprietary docs.

🎯 The Takeaway

Model choice matters less than having the right context.
A 4% Grok with your docs beats a 44% Gemini without them.