Testing LLMs is hard, doubly hard when the testplan and code are vibecoded

Part 4/4 — Part 4 ← Part 3 ← Part 2 ← Part 1

May 2026 — co-authored with Gemma 4

The transition from Exam V2 to V3 was necessitated by the discovery that the V2 harness was providing invalid scoring data, masking model instability and quantization failures.

Setup

Hardware: Framework 13 (Ryzen AI 370HX, 64GB DDR5).
Inference: llama-swap via Vulkan.
KV Cache: q8_0 (fixed across all runs).
Environment: Go-based scraper resilience task (buffering, eviction, background flush).

The Problem: V2 Harness Flaws

The Exam V2 harness relied on shell scripts and grep-based evaluation. This “vibecoded” approach introduced three critical failures:

Denominator Inflation: When a model’s code panicked and crashed the go test suite, the harness only counted the tests that had successfully run. This allowed models that crashed early to achieve 100% scores (e.g., 2/2 instead of 2/10).
Masked Instability: The loose scoring made it impossible to distinguish between a “bad” model and an “unstable” one. We saw high variance in Qwen and GPT-OSS across seeds that the harness failed to flag as high-uncertainty.
Fragile Parsing: Using grep to check for behavioral markers was prone to false positives/negatives, especially when models hallucinated markers or failed to follow formatting.

Trigger for rebuild: High-tier models (Qwen 3.6) were producing lower scores than lower-tier models, indicating the measuring stick was broken.

The Solution: Exam V3

V3 is a complete rewrite of the evaluation logic, moving from shell-based orchestration to an in-process Go grader.

In-process execution: The grader runs the tests directly, eliminating shell-parsing errors.
Deterministic scoring: Uses go test -race -json and jq to parse results.
Fixed Denominator: Every run is measured against the full test suite (13 tests), regardless of whether the process crashes.
Performance: Evaluation time per submission dropped from ~60s to ~4s.

The harness is exam-driver.go — a generic runner that submits the prompt to llama-swap over HTTP, saves the response, invokes the per-exam evaluator, and collects the score. The exam itself lives in blogpost/benchmarking_llms/bench/exam_v3/: the prompt given to the model, the scraper.go file it has to modify, the grader_test.go suite it’s scored against, and the eval.sh scorer wrapping the grader. A reference solution is in scraper_solution.go.

Results (Clean Rerun)

Note: All models were re-run to ensure a clean baseline on the new harness.

Model	Best (Seed)	Avg (3 Seeds)	Notes
Gemma 4 26B (MXFP4 + KV:Q8)	11/13	7.33	Most stable; consistent across seeds.
GPT-OSS 20B (MXFP4 + KV:Q8)	11/13	5.33	High variance: `5 / 0 / 11`.
Qwen 3.6 35B (Q5_K_M + KV:Q8)	7/13	2.33	Strongest Qwen; highly seed-dependent.
Qwen 3.5 35B (Q6_K + KV:Q8)	6/13	2.00	Significant performance drop.
Qwen 3.5 35B (MXFP4 + KV:Q8)	0/13	0.00	Consistent compile failure.

Key Findings

Quantization Instability: The 0/13 score for Qwen 3.5 MXFP4 suggests the current Unsloth GGUF stack/quantization is fundamentally broken for this task.
Seed Variance is Real: The 5/0/11 split for GPT-OSS highlights that single-seed benchmarks are useless for assessing coding reliability.
In-process is mandatory: The speed and accuracy gains of V3 make the shell-based V2 approach untenable for scaling.

What we got wrong

We assumed a shell-script wrapper was sufficient for complex behavioral testing. We prioritized ease of implementation over deterministic measurement, which led to the “denominator inflation” bug.

Next Steps

Provider Audit: Test alternative Qwen GGUF providers to see if the MXFP4 failure is a model issue or a packaging/quantization issue.
Phase 2 (Agentic Harness): Move away from synthetic Go tasks toward evaluating models on real-world tasks (analyzing actual repos, logs, and manifests).

Footnote:

All models were run with reasoning disabled. Worth a follow-up exam with reasoning enabled to see how much it moves the needle — especially for Qwen 3.5 / 3.6, which lean heavily on chain-of-thought.
Still surprised Qwen 3.6 didn’t score better; Gemma 4 totally kicked butt on this test.
vibecoded on hermes + gemma4 models