๐งช
Benchmarking AI Models
Stress-testing open-weight LLMs on what actually matters for agent-style workloads: reliable tool calling with realistic tool schemas, coding quality, latency under concurrency, and true $/turn cost. A lot of leaderboard hype doesn't survive contact with production.
๐ Rough Results
| Model | Tool calls | Coding | Latency | Verdict |
|---|---|---|---|---|
| Qwen3-Next-80B-A3B | โ clean | โ concise | ~780 ms | Best snappy all-rounder |
| MiniMax-M2.7 (230B/10B active) | โ 9/10 | โ thorough | ~3.7 s | Always reasoning โ heavy work |
| Qwen3.6-35B-A3B (thinking off) | โ | โ | ~1.4 s | Solid and newer |
| Gemma 4 31B (dense) | โ | โ cleanest TypeScript | ~1.4 s | Careful coder |
| Gemma 4 26B-A4B | โ | โ ๏ธ messy | ~960 ms | Cheap mass ops |
| Llama 4 Scout | โ mangled args | โ | ~1 s | Host parser broken |
Tested against realistic agent request shapes (many tools, long context). Hosting providers intentionally omitted.
๐ง Takeaways
- "Function calling: Yes" on a spec sheet doesn't mean it works in production. One widely-praised model truncated arguments at 8 tokens on its main host.
- Reasoning-on-by-default models feel brilliant on benchmarks and sluggish in a chat window.
- For agent workloads, input cost dominates โ pick on input price, not output.
- Mixed-model strategy beats single-model. Snappy tasks on a fast cheap model, deep reasoning on a slower one โ each role gets the right tool.
๐ญ Coming Soon: Open Model Benchmark
Building a public benchmark page โ reliable tool calling, long context under load, true cost per turn, and how models behave with 30+ tool schemas in play. Open methodology, reproducible scripts. Bookmark โ it's coming.