J.Jeyaram

🧪

Benchmarking AI Models

Stress-testing open-weight LLMs on what actually matters for agent-style workloads: reliable tool calling with realistic tool schemas, coding quality, latency under concurrency, and true $/turn cost. A lot of leaderboard hype doesn't survive contact with production.

📊 Rough Results

Model	Tool calls	Coding	Latency	Verdict
Qwen3-Next-80B-A3B	✅ clean	✅ concise	~780 ms	Best snappy all-rounder
MiniMax-M2.7 (230B/10B active)	✅ 9/10	✅ thorough	~3.7 s	Always reasoning — heavy work
Qwen3.6-35B-A3B (thinking off)	✅	✅	~1.4 s	Solid and newer
Gemma 4 31B (dense)	✅	✅ cleanest TypeScript	~1.4 s	Careful coder
Gemma 4 26B-A4B	✅	⚠️ messy	~960 ms	Cheap mass ops
Llama 4 Scout	❌ mangled args	—	~1 s	Host parser broken

Tested against realistic agent request shapes (many tools, long context). Hosting providers intentionally omitted.

🧠 Takeaways

"Function calling: Yes" on a spec sheet doesn't mean it works in production. One widely-praised model truncated arguments at 8 tokens on its main host.
Reasoning-on-by-default models feel brilliant on benchmarks and sluggish in a chat window.
For agent workloads, input cost dominates — pick on input price, not output.
Mixed-model strategy beats single-model. Snappy tasks on a fast cheap model, deep reasoning on a slower one — each role gets the right tool.

🔭 Coming Soon: Open Model Benchmark

Building a public benchmark page — reliable tool calling, long context under load, true cost per turn, and how models behave with 30+ tool schemas in play. Open methodology, reproducible scripts. Bookmark — it's coming.

Now

Benchmarking AI Models

📊 Rough Results

🧠 Takeaways

🔭 Coming Soon: Open Model Benchmark