Testing LLMs is hard, doubly hard when the testplan and code are vibecoded

, 7 May 2026

Part 4/4 — Part 4 ← Part 3 ← Part 2 ← Part 1 May 2026 — co-authored with Gemma 4 The transition from Exam V2 to V3 was necessitated by the discovery that the V2 harness was providing invalid scoring data, masking model instability and quantization failures. Setup Hardware: Framework 13 (Ryzen AI 370HX, 64GB DDR5). Inference: llama-swap via Vulkan. KV Cache: q8_0 (fixed across all runs). Environment: Go-based scraper resilience task (buffering, eviction, background flush).

WHY Are Local LLMs So Slow On My Framework 13 AMD Strix Point

, 10 April 2026

*February 2026 – co-authored with Claude Opus 4.6. Part 2/4 — Part 4 ← Part 3 ← Part 2 ← Part 1 Yes, the title is clickbaity :>. Veritasium has a great video about why clickbait is unreasonably effective and I’ve been dying to try it on a technical post. The irony is that the actual content is the opposite of clickbait – every claim backed by a shell command, every number derived from first principles.

Gemma 4 vs Qwen3.5: benchmarking quantized local LLMs on Go coding

, 10 April 2026

April 20261 Part 3/4 — Part 4 ← Part 3 ← Part 2 ← Part 1 In episode 1 three models tied at 13/15. The test was too easy — it couldn’t separate a good model from a mediocre one having a lucky run. Since then Qwen3.5, Gemma 4, and Qwen3-Coder dropped. They’d all tie too. We needed a harder exam and better methodology. We also suspected (correctly) that single-seed results were noise and that our grep-based scoring was garbage, so we planned for multi-seed and real test execution from the start.

I benchmarked 8 local LLMs writing Go on my Framework 13 AMD Strix Point

, 10 April 2026

Part 1/4 — Part 4 ← Part 3 ← Part 2 ← Part 1 (Feb 2025) I have a Framework 13 with a Ryzen AI 370HX and a bunch of GGUF models accumulating in ~/.cache/llama.cpp/. I wanted to know if any of them can actually write Go that compiles and runs. Not vibes, not leaderboard numbers – go build says yes or no. Goal was to have some sense of where local models are in terms of practical capability, being limited in size and available ram/compute

How we’ve improved Dune API using DuckDB

, 1 August 2024

At Dune, we value our customers’ feedback and are committed to continuously improving our services. This is the story of how a simple, prioritized feature request for DuneAPI —supporting query result pagination for larger results—evolved into a comprehensive improvement involving the adoption of DuckDB at Dune. We’ve learned a lot during this journey and are excited to share our experiences and the new functionalities we’ve been building. Motivation & Context The journey began with user feedback and a repeated feature request: “Dune API doesn’t support pagination, and the maximum size of query results is limited (~1GB).

Posts

Testing LLMs is hard, doubly hard when the testplan and code are vibecoded

WHY Are Local LLMs So Slow On My Framework 13 AMD Strix Point

Gemma 4 vs Qwen3.5: benchmarking quantized local LLMs on Go coding

I benchmarked 8 local LLMs writing Go on my Framework 13 AMD Strix Point

How we’ve improved Dune API using DuckDB

Miguel Filipe