Now

๐Ÿงช

Benchmarking AI Models

Stress-testing open-weight LLMs on what actually matters for agent-style workloads: reliable tool calling with realistic tool schemas, coding quality, latency under concurrency, and true $/turn cost. A lot of leaderboard hype doesn't survive contact with production.

๐Ÿ“Š Rough Results

Model Tool calls Coding Latency Verdict
Qwen3-Next-80B-A3B โœ… clean โœ… concise ~780 ms Best snappy all-rounder
MiniMax-M2.7 (230B/10B active) โœ… 9/10 โœ… thorough ~3.7 s Always reasoning โ€” heavy work
Qwen3.6-35B-A3B (thinking off) โœ… โœ… ~1.4 s Solid and newer
Gemma 4 31B (dense) โœ… โœ… cleanest TypeScript ~1.4 s Careful coder
Gemma 4 26B-A4B โœ… โš ๏ธ messy ~960 ms Cheap mass ops
Llama 4 Scout โŒ mangled args โ€” ~1 s Host parser broken

Tested against realistic agent request shapes (many tools, long context). Hosting providers intentionally omitted.

๐Ÿง  Takeaways

  • "Function calling: Yes" on a spec sheet doesn't mean it works in production. One widely-praised model truncated arguments at 8 tokens on its main host.
  • Reasoning-on-by-default models feel brilliant on benchmarks and sluggish in a chat window.
  • For agent workloads, input cost dominates โ€” pick on input price, not output.
  • Mixed-model strategy beats single-model. Snappy tasks on a fast cheap model, deep reasoning on a slower one โ€” each role gets the right tool.

๐Ÿ”ญ Coming Soon: Open Model Benchmark

Building a public benchmark page โ€” reliable tool calling, long context under load, true cost per turn, and how models behave with 30+ tool schemas in play. Open methodology, reproducible scripts. Bookmark โ€” it's coming.