The Gap Between Claude and Local: Can a Self-Hosted Coding Agent Compete?

Posted on by John
LLMs Claude Qwen Local Inference LM Studio Agentic Coding

I set out to find how big the gap between a Claude subscription and a self-hosted setup actually is, and whether a local coding agent is viable for real work. The experiment ran in two phases. First, five arms (four local open-weight variants chosen to span the capability/context/speed tradeoff, against Claude Opus 4.7 as the cloud baseline) each designed a complete Playwright E2E test suite for a real Laravel + Livewire app from scratch. Then the best plan was handed back out to be built: Claude Code against the strongest local arm, head-to-head on the same plan. It's as much a practical guide to getting real work out of a 24 GB card as it is a contest: how to choose the model, the quant, and the context window so a long agent run has the room to finish without compacting, the agentic-coding tips I use and teach at work, and one structural advantage Claude has that doesn't show up in any model card.

This is not meant to be a fair fight. Quant levels and context windows are chosen to fit 24 GB of VRAM under realistic conditions, not to give each model its theoretical best precision. The point isn't “who wins a level playing field”; it's “given the constraints anyone would actually face, what does the gap look like for real work?”

Who Is This For?

Let's be honest about the audience. This is for tinkerers and people with strict privacy requirements. It is not a cost play.

An RTX 4090 costs ~$1,600–2,000. A Claude Max subscription starts at $100/month. That's 16–20 months of Claude for the price of the GPU alone, and that's before the rest of the rig, the electricity, or your time configuring llama.cpp settings. And if you already own the hardware, the experience gap between Claude Code and local models is significant enough that the subscription is likely worth it. If you can afford a $2,000 GPU, you can afford $100/month.

Running locally costs you two kinds of parallelism. The first is the human kind: with a subscription you can keep one Claude session writing code in this repo, another helping debug an unrelated bug, and a third drafting an email. Three simultaneous conversations, each with its own state, none of them fighting over your local hardware. With one 24 GB GPU and one loaded model, you have one conversation at a time. Period.

The second is the agent kind, and it's the one that matters more. Claude Code can spawn subagents that explore in parallel: three reads of three different directories, each running concurrently in its own fresh context and returning a summary. The obvious benefit is wall-clock speed. The bigger one is context efficiency, because the main conversation never sees the raw subagent transcripts, only the distilled summaries. A local single-conversation agent has to do all of that exploration in the same thread it'll later use for planning and writing code, so every grep, every file read, and every dead end becomes permanent context bloat.

The real reasons to go local:

  • Privacy: your code never leaves your machine. No API calls, no telemetry, no third-party data processing.
  • Tinkering: you want to understand how local inference works, optimize VRAM budgets, and experiment with open models.
  • Availability: no rate limits, no outages, no dependency on someone else's infrastructure.
  • Offline use: works without internet, on a plane, in an air-gapped environment.

The Task

The test is a real coding task on a real codebase: design a complete Playwright E2E test suite for gotflashes (pinned at commit 22d2155), an open-source Laravel 12 + Livewire v3 sailing-activity tracker. The app has 186 PHPUnit tests covering its data layer but zero browser coverage. The gap is everything that only matters in a browser: flatpickr multi-date pickers, TomSelect dropdowns, Livewire morph.updated race conditions, toast notifications, DaisyUI modals, and HTML quality across public, authenticated, and admin pages. A real Playwright suite has to complement PHPUnit, not duplicate it.

Each agent gets a single prompt asking for a complete seven-section plan (a test inventory through to HTML validation) covering seven feature areas (auth through admin), five named JavaScript modules, and the cross-feature flows. What separates the plans is three load-bearing constraints the prompt foregrounds: dynamic dates (the app derives allowed ranges from now()), Laravel-seeded fixtures with storageState auth, and observable-state waits (not waitForTimeout) for Livewire DOM transitions. A good plan folds those in as premises; a weak one treats them as checkboxes.

Five model arms run that prompt. Four locally-hosted configurations through OpenCode (the open-source coding agent), and Claude Opus 4.7 through Claude Code with the 1M-token context window enabled and reasoning effort set to extra-high, against my existing Claude Max subscription.

The task is deliberately multi-skill. Codebase comprehension across 40-odd files. Specification writing with a prescribed structure. Instruction-following (exactly seven sections, no extras, no matter how much the model wants to add a section on accessibility or performance). And constraint awareness for the three load-bearing concerns above. A plan that's right on structure but wrong on selectors is half a plan; a plan that names the right selectors but adds three unrequested subsections is also half a plan. The grading reflects that.

Why two harnesses?

A purist would run all five arms through a single coding-agent wrapper (same tool-call protocol, same UI, same conventions) to keep everything controlled but the model. I didn't. Claude ran in Claude Code (Anthropic's official CLI, against my Claude Max subscription), and the local models ran in OpenCode (open source). Three reasons:

  1. Cost. Running Claude inside OpenCode means paying the Anthropic API per token, on top of the Claude subscription I already pay for. The subscription's value proposition is “all you can eat”; switching to metered API billing for the same model defeats it. Most people who'd reach for Claude on a real coding task are paying for the subscription, not the API.
  2. The reverse doesn't work well. Using third-party models inside Claude Code exists as a workaround, but it isn't well supported: you can't set context size, sampler parameters aren't first-class, and the OpenAI-compatible endpoint conventions are second-class citizens. Losing the context-size knob is the worst of those: the agent needs to know its real window to manage compaction at the right threshold, and compaction is a capability event, not just a workflow event (see Fitting the Local Arms below). If Claude Code assumes a 200k window while the local model is actually loaded at 163k, the compaction trigger fires too late (hitting hard context-limit errors mid-run) or never at all (truncation begins silently and the agent stops noticing what it can no longer see). On 24 GB of VRAM context is also the binding hardware constraint, so you need that knob for two reasons, not one.
  3. It fits the proprietary-vs-open-source story. Claude Code is closed; OpenCode is open. Each agent is being used the way someone who'd committed to that side of the divide would actually use it. “Claude through OpenCode” or “Qwen through Claude Code” are exotic configurations; “Claude Code with Claude” and “OpenCode with local models” are what people run.

The cost is that some differences between the runs aren't purely about the underlying model. Claude Code spawns subagents; OpenCode (currently) doesn't. Claude Code has a dedicated Plan Mode artifact; OpenCode's plan mode lives in a chat message. I call those out where they affect the results; they're named confounds, not hidden ones.

Why I judge on plans

Agentic-coding tip: always start in plan mode. A plan is fast to read, and reviewing it does three jobs at once: it confirms the model actually understood the task, it exposes the model's biases before they become code (the assumptions that steer it somewhere you didn't want to go), and it makes the code review that follows fast. Once you've signed off on the plan, the diff is mostly what you expected, so you're skimming for deviations instead of reverse-engineering intent. Catch the wrong direction in a two-minute read, not a debug cycle per file.

Evaluating that plan is the core of this experiment. The plan is the artifact worth grading on its own.

A plan is the model's first-pass thinking made visible: which files it read, what it's committing to, and, most usefully, where its biases are pointing it. A model that's decided the app uses /flashes routes, or that waitForTimeout is a fine way to wait on Livewire, tells you so in the plan, before that assumption has metastasized into forty spec files. Those biases don't disappear at implementation time; they become bugs and rework. Plan quality is a leading indicator of what the implementation looks like: bad plan, noisy implementation; good plan, clean one.

The plan is also independently easier to measure. Plans take minutes where implementations take hours. They fit on a few pages and read side-by-side; implementations are tens of thousands of lines you'd never compare manually. If you want to answer “is this model good enough for real coding work” without committing a day to each candidate, the plan answers most of the question.

The implementation experiment comes later in the post, and it produces one more useful data point that emerges naturally from the design. Once I name a winning plan, the obvious move is to hand it to a local model and see what falls out. That's the hybrid pattern the vendors themselves recommend (Anthropic suggests Opus to plan, Sonnet to execute), and it answers whether the cloud is really needed for the whole pipeline or only the design phase.

The rest goes in three parts: first the local-inference setup and the dial-in surprises (the ones that cost me a weekend), then the five-arm plan comparison, then the Claude-vs-local implementation head-to-head. If you're here for the verdict and not the VRAM math, skip ahead to How Each Plan Performed.

Hardware & Software: A Dev Machine, Not a Dedicated Inference Box

This is important context for everything that follows: these models are running on my personal workstation in its standard/everyday configuration. Multiple monitors are plugged into the RTX 4090. Browsers, Slack, IDE, Discord: they're all running while LM Studio loads a 17 GB model. The Desktop Window Manager (DWM) is compositing windows on the same GPU that's doing inference. This isn't a sterile lab setup with displays routed to an iGPU; it's a daily-driver workstation that occasionally hosts an LLM.

That matters because DWM and other GPU clients consume 1.5–4 GB of VRAM, and that number drifts as you work. Every context-size and KV-quant choice in this post is downstream of that constraint. I considered moving displays to the AMD iGPU to reclaim that VRAM. I decided not to: it'd mean tearing down a dev environment to validate a tradeoff most readers won't make either. The numbers in this post are what you get when local LLM inference shares the GPU with your day-to-day work. If that's your actual scenario, the findings transfer cleanly. If you're running headless or on a dedicated inference rig, treat these numbers as conservative.

ComponentDetail
GPUNVIDIA RTX 4090 (24,564 MiB VRAM)
Display setupMultiple monitors plugged into the 4090 (not iGPU-routed)
CPUAMD Ryzen 9 9950X3D
Inference runtimeLM Studio 0.4.12 (Build 1), llama.cpp CUDA 12 runtime v2.14.0
Coding agentOpenCode (models registered with the context limits from The Models table below)
OSWindows 11 (WSL2 for OpenCode)
Observed DWM + apps baseline~1.5 GB at clean boot, drifting to ~3–4 GB during normal use

The Models

The question this experiment asks is narrower than “what's the best open-weight coding model?” It's what's the best you can do with open weights on a flagship consumer GPU? The unconstrained leaderboards (Aider, SWE-bench, LiveCodeBench) are topped by large mixture-of-experts models (DeepSeek, Kimi, and the like) that need far more than 24 GB and never get to run here. Draw the box at a single 24 GB card and the field thins fast; the champion of that bracket is Qwen, the strongest open-weight family that fits, usually named alongside Mistral's Devstral for local work. So reaching for a Qwen variant isn't what's being tested; it's the premise. What's being tested is whether the best answer to that constrained question lands close enough to the closed-weight leader to be worth running locally at all.

Scope note: “flagship consumer GPU” means GPU VRAM: the 4090's 24 GB. I'm deliberately not counting the other local route, where you offload a much larger model to CPU or run it on a Mac's unified memory at a handful of tokens per second. That works, and plenty of people do it, but an interactive coding agent generating at 3 tok/s isn't a workflow I'd use. Throughput is part of the question here, not just whether the weights fit.

The experiment tests four local configurations because model size forces a three-way tradeoff: capability (larger models reason better), context window (larger models leave less room for context), and speed (larger models generate fewer tokens per second). A 9B model fits 262k context at full precision; a 27B model at the same precision fits under 100k. The real question is which point on the capability/context/speed curve produces the best plan on 24 GB of VRAM.

Five arms total: four local Qwen configurations plus Claude Opus 4.7 as the cloud baseline. Local models from Unsloth's GGUF releases. Qwen 3.6 ships in two sizes only (27B dense, 35B-A3B MoE), so the small/long-context slot stays on Qwen 3.5-9B.

Arm Source Quant File size Active params/tok Context (loaded) Gen tok/s
Qwen 3.5-9B unsloth/Qwen3.5-9B-GGUF UD-Q8_K_XL 12.7 GB 9B (dense) 262,144 52-57
Qwen 3.6-27B (Q4 arm) unsloth/Qwen3.6-27B-GGUF UD-Q4_K_XL 17.6 GB 27B (dense) 98,304 43 → 22 (throttled)
Qwen 3.6-27B (Q3 follow-up) unsloth/Qwen3.6-27B-GGUF UD-Q3_K_XL 16.3 GB 27B (dense) 163,840 50.8
Qwen 3.6-35B-A3B unsloth/Qwen3.6-35B-A3B-GGUF UD-Q4_K_XL 23 GB 3B (MoE, 8+1 of 256 experts) 131,072 45-78
Claude Opus 4.7 (cloud baseline) Anthropic API 1,000,000 (effective: higher via subagents) ~60-100

KV cache: why Q8, not FP16 and not Q5_1

KV cache: the keys and values each attention layer computes per token, held in VRAM so they aren't recomputed on every following token. It grows linearly with context length (a longer conversation costs more memory), which is why, on a fixed 24 GB, context size and VRAM are the same knob.

Reading the precision: FP16 stores each cached value in a full 16 bits; Q8 and Q5_1 pack it into 8 or ~5. Same bit-width idea as the model-weight quants (UD-Q4_K_XL and friends) in the arms table above, just applied to the cache, not the weights.

All arms use Q8/Q8 KV cache. FP16 (llama.cpp's default) doesn't fit: the 27B at 163k context needs ~10.8 GB just for KV, blowing past 24 GB once you add the model itself. Q8 halves that cost for a quality loss small enough that broad consensus treats Q8 KV as effectively lossless, and its unpack is nearly free: an 8-bit value is a whole byte, so dequant is one scale-multiply on a contiguous read, and it stays on the fast attention kernel. It's the free downward step.

Going lower backfires. Q5_1 saves another ~1.3 GB of VRAM on paper, but 5-bit values aren't byte-aligned: each one has to be bit-unpacked, off the fast kernel path, on every attention pass, every layer, every step. On a 24 GB GPU with limited compute headroom, that overhead came directly out of throughput: gen dropped from 43 → 18 tok/s. Worse, at high context fill the model started emitting completions with zero content tokens: pure thinking-mode reasoning that hit max_tokens before producing a final answer. Not gibberish; not a wrong answer; an empty response. I went back to Q8 across the board.

Inference Settings

LM Studio Load tab (applies to all arms)

  • Max Concurrent Predictions1
  • Evaluation Batch Size512
  • Flash AttentionON
  • Unified KV CacheON
  • K Cache / V CacheQ8_0 / Q8_0
  • Try mmap()ON
  • Keep Model in MemoryON
  • Offload KV Cache to GPU MemoryON
  • Limit model offload to dedicated GPU memoryOFF
LM Studio load tab settings for the Qwen 3.6 27B model

Qwen Configuration

Per Unsloth's Qwen 3.6 documentation for “Precise Coding Tasks” in thinking mode:

  • Temperature0.6
  • Top-K20
  • Top-P0.95
  • Min-P0.0
  • Repeat penalty1.0
  • Presence penalty0.0
  • ThinkingON

(3.6's non-thinking coding profile uses presence_penalty=1.5, but Unsloth explicitly recommends thinking mode for code-focused work on 3.6. I followed that across all Qwen 3.6 arms. The Qwen 3.5-9B used the same precise-coding profile with thinking disabled by default.)

LM Studio inference settings for the Qwen 3.6 models

Fitting the Local Arms on 24 GB

What matters for local inference speed isn't whether the model fits at load time, but how much VRAM is free during generation, for the dynamic compute buffers attention needs. The two numbers diverge sharply: the 27B at 110k showed 3.9 GB free right after loading but only ~265 MB free under nvidia-smi mid-generation. And the gap is unforgiving: once it forces KV cache to spill to CPU, generation craters from 33 to 5–10 tok/s, because KV is re-read by every attention head every token. The column that matters in the budget below isn't “free at load,” it's live free during gen.

Arm Context Free at load Live free during gen Stability
Qwen 3.5-9B 262k Q8 ~8,200 MiB ~4,500 MiB Bulletproof
Qwen 3.6-27B (Q4_K_XL) 98k Q8 3,270 MiB 220 MiB VRAM-sensitive
Qwen 3.6-27B (Q3_K_XL) 163k Q8 4,088 MiB 676 MiB Healthy
Qwen 3.6-35B-A3B (5 CPU layers) 131k Q8 3,321 MiB 95-179 MiB across fill OK

Full per-arm MiB breakdown (model and KV-cache footprints) is in docs/dialed-in-settings.md.

The four arms are really three models, and three different answers to one question: what do you give up to fit a capable coder in 24 GB?

ContextSpeedPrecisionHeadroomPlan27B · Q4ContextSpeedPrecisionHeadroomPlan27B · Q3ContextSpeedPrecisionHeadroomPlan35B-A3BContextSpeedPrecisionHeadroomPlan9B
Each axis 0–10, normalized from the tables above (context is log-scaled; precision is quant fidelity, not output quality). Shapes are for comparing tradeoff silhouettes, not exact values; the receipts are in the tables. Note the 9B fills every hardware axis yet collapses on Plan: all the headroom in the world, weakest output.

27B: trade precision for context

The dense 27B at UD-Q4_K_XL is the obvious “quality” pick, Unsloth's highest-quality 4-bit quant, imatrix-calibrated to protect the attention layers that carry meaning. Unsloth's docs put its recommended minimum context at 110k, which is also where it dials in on clean VRAM. But on a daily-driver desktop I could only fit 98k before live-free VRAM dropped into the thrashing zone, already below that recommended floor, and at 98k the planning task wanted more state than that, so the run hit OpenCode's compaction trigger twice.

Agentic-coding tip: compaction degrades capability, not just speed. When OpenCode hits its compaction trigger (around 70% of the context window), it summarizes prior context to free room. The summary preserves what the agent remembers it knew, not the raw evidence. If the dropped content was load-bearing (route definitions, real selectors, schema columns), the agent confidently invents replacements for what it can no longer see. Compaction reads as a workflow inconvenience but it's actually a quality event: a model that hits compaction at the wrong moment loses fidelity in ways no benchmark tracks.

Rather than cut context further (which just triggers compaction earlier), I dropped one bit of precision instead. UD-Q3_K_XL is only ~0.3 GB smaller on disk but ~3 GB smaller on the GPU (13,110 vs 16,104 MiB; Q3 packs denser than the file-size delta suggests), and that headroom bought 163k context (more than the Q4 dial-in target) and lifted clean-probe generation from 43 to 51 tok/s. The lesson cuts against instinct: on VRAM-bound hardware, don't reach for the highest-precision quant that fits; reach for the highest-precision quant that fits with room for a healthy compute buffer.

35B-A3B: trade density for cheap offload

The 35B is a mixture-of-experts model: 35B total params, but only ~3B active per token (8 routed experts + 1 shared). That rewrites the offload math. The 23 GB file won't fit whole, but because each token only touches the active experts, pushing layers to CPU is cheap (roughly 1.5 tok/s per offloaded layer versus 2–3 for a dense model), and the model keeps running on a sliver of live-free VRAM (95–179 MiB) where the dense 27B would tank. MoE is the one architecture here that degrades gracefully when it spills. (Don't force the expert count down to save memory, though: Qwen 3.6 is trained for its 9 active experts, and overriding that just costs quality.)

9B: trade scale for headroom

The 9B is the counterpoint on the other axis: far smaller in total than the 27B, but every one of its params is active each token, roughly three times the active compute of the 35B MoE. Small total plus full precision is why it fits 262k context with ~4.5 GB to spare and never feels VRAM pressure. It's the control arm: when something with no fitting constraints still does something interesting, you know it's the model, not the hardware.

Agentic-coding tip: long sessions get slower, then dumber. Every token an agent generates attends to the whole conversation so far, so generation slows as context fills, and every harness eventually compacts to stay under its limit. The same factors are in play on the cloud; the Claude API's vast resources just mask them, so you feel it as latency and cost rather than a visible collapse. On a fixed 24 GB you feel it sharply: the growing KV cache craters throughput and forces compaction at a far lower ceiling (163k here vs the 1M Claude ran with). Either way, scope each session to one task and start fresh rather than letting one balloon.

Experiment Results: Five-Arm Plan Comparison

I ran the same planning prompt across all five arms against the gotflashes codebase: ~50k tokens of code, templates, and migrations to read before writing a line of plan.

Arm Context window Peak used Tokens generated Wall-clock Compactions Operator pauses
Claude Opus 4.7 (xhigh, via Claude Code) 1M ~103k ~92k 8:14 0 (3 subagents) 0
Qwen 3.5-9B (UD-Q8_K_XL) 262k ~75k ~15k 4:29 0 0
Qwen 3.6-35B-A3B (UD-Q4_K_XL) 131k ~105k ~13k ~6:00 1 2
Qwen 3.6-27B (UD-Q4_K_XL) 98k ~70k ~25k 22:36 2 4
Qwen 3.6-27B (UD-Q3_K_XL) 163k ~105k ~12k 7:15 1 3

The gap between window and peak-used is the tell: OpenCode compacts near 70% of the window, so the local arms never reach their ceilings. The Q4 27B topped out around 70k against a task that wanted ~100k of state (which is why it compacted twice and thrashed), while the 9B coasted at 75k of its 262k and Claude used ~103k of its 1M without coming close.

The tokens-generated column tells the other half: Claude spent multiples more producing its plan (~92k tokens against 12–25k for each local arm), and that figure is mostly extended reasoning (Claude ran at extra-high effort). It's also main-thread only: the three subagents it ran in parallel generated roughly 26k more on top, putting Claude's true total near 118k. More inference thrown at the same prompt, for the top-ranked plan. Whether that's worth it is the rest of the post.

The two tokenize differently (Claude's tokenizer runs denser, especially on code), so read the multiple as approximate, not a clean ratio. Even discounted for that, the bulk of the gap is real reasoning, not accounting.

One operational wrinkle: OpenCode's Plan Mode sometimes pauses to ask the operator a clarifying question; Claude Code never did. Pauses tracked compactions: once compaction dropped load-bearing context, the model asked rather than guessed. I answered minimally to keep human input uniform.

How each plan performed

The numbers above are what each run cost; whether the plan that came out is any good is another matter. Here's what that comes down to: the same feature area (Authentication), from Claude's plan and the Q4 27B's:

Claude

  • it('registers a new user with district + fleet and lands on /logbook') — Pre: anon, districts/fleets seeded. Verifies: navigation to /logbook, success toast, new user row in DB.
  • it('shows a validation error when password and confirmation differ') — Verifies: Livewire re-renders with .input-error, no navigation.
  • it('logs in with valid credentials and redirects to /logbook')

Qwen 3.6-27B Q4

  • A-01 /register — valid data creates a user and logs in.
  • A-08 /login — valid credentials redirect to /flashes.
  • A-10 /login — redirect to /login when hitting /flashes unauthenticated.

Same feature area, two plans, lightly condensed. Claude commits to the real route (/logbook) with per-test preconditions and assertions; Q4 routes everything through the hallucinated /flashes and never gets past a one-line description. A wrong route in the plan becomes a 404 in every spec built from it.

That difference (real routes and assertions versus made-up routes and one-liners) is the kind of thing the rubric has to score. I graded all five on six dimensions (Structure / Constraints / Specificity / Coverage / Grounding / Conciseness), 1–10 each, unweighted mean, best to worst left to right:

DimensionClaude35B-A3B27B Q327B Q49B
Structure108755
Constraints107755
Specificity106644
Coverage108558
Grounding104633
Conciseness96773
Mean9.86.56.34.84.7

Full disclosure: the grading was done by Claude (Opus 4.7), with heavy input from me. This would be a conflict of interest if models had feelings about winning; they don't. It does muddle the methodology, and I won't pretend otherwise; this is the level of rigor a side project gets once it's months past the weekend I budgeted for it. The rubric and all five plans are in the repo: grade them yourself if you'd rather not take a robot's word for it.

Claude Opus 4.7 9.8: implementable end-to-end

The only plan that picks up the prompt's three foregrounded constraints (dynamic dates via DateRangeService, Laravel-seeded fixtures + storageState auth, and JS-timing-as-the-risk-surface for Livewire morphs) and uses them as load-bearing premises rather than checklist items. Section 6 (the JavaScript integration testing section) uses observable-state waits everywhere (_flatpickr.selectedDates.length, .has-entry class transitions, morph.added/morph.updated event sequences) instead of the waitForTimeout anti-pattern the prompt explicitly warned against. Selectors match the real codebase: #date-picker / #date-picker-single, #activity_type / _edit, #district-select / #fleet-select, the actual <div class="alert alert-${type}"> toast structure. Proposes a concrete app/Console/Commands/E2eSeedCommand.php with named scenarios (base, fresh-user, many-flashes, admin, tiered, leaderboard), implementable as-is.

Minor blemishes: a TODO leak (“relax via count later if needed”), a process.env.PW_PROJECT reference to a variable Playwright doesn't actually use (real: PLAYWRIGHT_PROJECT or test.info().project.name), and a year-dropdown test gated on January && currentYear > START_YEAR that won't run in CI 11 months a year. Real bugs but trivially fixable.

Reads like a senior engineer who has run a Playwright + Livewire suite before, knew where Livewire bites, and pre-empted the bites.

Qwen 3.6-35B-A3B 6.5: looks polished, won't run as written

Structurally faithful: all 7 sections in order, A-G feature areas mapped 1:1 to the prompt, describe/it naming, covers all 5 JS modules. Adds a sensible “Progress & Awards” sub-grouping that maps cleanly to real ProgressCard behavior.

But the selectors are extensively confabulated. The date picker is consistently called #flash-date-picker (real: #date-picker). Toast selectors are wrong: .alert-title, .alert-body, .toast-icon. The real toast has no title/body split and no .toast-icon class. The TomSelect option selector is given as .ts-dropdown option, but real options carry class .option and aren't <option> tags. Sailor-logs filter IDs are off by a prefix.

Section 7 also expands beyond the prompt's “exactly 7 sections” constraint with unrequested “CSS/visual quality checks” and “Accessibility quality checks” subsections. The seeding section is internally contradictory: claims storageState but the regularUser fixture logs in via UI every time, and FLASH_SEEDING_STRATEGY hardcodes dates contradicting the dynamic-date constraint the same plan acknowledges elsewhere.

The pattern is: wide coverage, confidently expressed, mostly wrong about the actual DOM contracts of the libraries the app uses. A senior engineer reading this plan would say “I'd have to fix the selectors on basically every spec before any of them runs.”

Qwen 3.6-27B Q3_K_XL 6.3: narrower but better-grounded

Selector accuracy is meaningfully better than 35B in the places that matter: gets #date-picker / #date-picker-single, #activity_type / _edit, #sailing_type / _edit, and, crucially, the correct sailor-logs filter IDs (#sailor-logs-district-select / #sailor-logs-fleet-select). Admin routes correct. Calls out the requestAnimationFrame cycle and morph.added reinit explicitly. Storage-state pattern described correctly. Tiered-user fixtures (tier1User, tier2User, tier3User, capUser, overCapUser) are well-thought-out for the milestone and non-sailing-cap tests.

But two of the prompt's seven feature areas are missing: Profile Management and Cross-feature Flows. The A-G framing was renamed and these two were dropped on the floor. A handful of profile tests are sprinkled into the Authentication describe-block (“shows pending email alert in profile after email change”), but there's no district/fleet edit coverage, no profile personal-info edit coverage, no register→profile flow, no log→leaderboard flow, no milestone-cross-feature tests.

Some selectors are still off (toast .toast-icon / .toast-message / .toast-close are invented; verification banner is referenced as .verification-banner class when it's actually #verification-banner ID; TomSelect dropdown classes are wrong). Still uses waitForTimeout(100)/waitForTimeout(200) in JS sections.

My read matches the grader's: of the four Qwen plans, Q3 is the strongest candidate for actually implementing. The missing feature areas are a structural gap that an implementation agent can recover from by re-reading the codebase. Confabulated selectors are a grounding failure that costs a debug cycle per spec. Q3 has more of the former and less of the latter. And, notably, Q3 also caught real Livewire-event-driven behaviors Claude missed (sailor-logs mutual filter clearing, URL→state on tab switching), so it's not purely a damage-minimization pick.

Qwen 3.6-27B Q4_K_XL 4.8: confidently wrong about URLs

Shortest plan (416 lines) and the most damaging failure mode: route hallucinations throughout. /flashes for the logbook (real: /logbook), /flashes (profile section) for the profile page (real: /profile is its own route), /admin/awards and /admin/logs for admin (real: /admin/fulfillment and /admin/sailor-logs). Every spec navigating to these routes would 404 on the first call. Q4 also invents a whole date-of-birth-validator.js module that doesn't exist (DOB logic lives in user-profile-form.js).

Section 7 adds unrequested Accessibility and Performance Baselines subsections despite “no additional sections” in the prompt. Profile management is folded into a single tab with seven tests, all addressed via the wrong route.

The underlying VRAM/compaction story is in Fitting the Local Arms above; this critique sticks to what's in the resulting plan.

Qwen 3.5-9B 4.7: quantity over precision

Highest raw test count (~100 enumerated test cases across A-G) and includes /export/user-data in HTML validation that other plans missed. The 1,327 lines look impressive in word count.

But:

  • Uses snake_case test_user_can_visit_home_page-style names throughout Section 1 instead of the requested describe/it naming, a structural prompt-adherence failure
  • Hardcoded 2026-05-15 dates in FLASH_SEEDING_STRATEGY despite a separate DATE_STRATEGY block correctly computing dynamic dates, directly contradicting the prompt's date-handling constraint
  • Invents Livewire action names (wire:click="open-edit-modal"; the real method dispatches as openEditModal($flashId))
  • 1,327 lines mostly consisting of repetitive code snippets that re-explain the same Playwright pattern
  • Adds a closing “Summary” section duplicating the seven sections in bullet form

The 9B's volume is the failure mode. Many specs that need rework, many that wouldn't run because of the date and wire:click hallucinations. Mostly redeemed by sheer enumerative breadth: when you list every conceivable test case, you happen to catch ones the bigger models missed (rate-limit tests, audit-log checks, empty-export blocking). But a senior engineer trying to use this as a starting point would spend most of the time deleting.

Cross-plan patterns that recurred

Four findings from the grade that show up in multiple Qwen plans and never in Claude's:

  1. waitForTimeout as a default wait strategy. All four Qwen plans use it after Livewire interactions. The prompt's Section 6 framing was a direct invitation to design observable-state-based waits. Only Claude picked up the signal.
  2. Hallucinated selectors cluster around third-party libraries. Toast structure and TomSelect classes get confabulated across multiple plans. These are libraries whose documented class names are similar across versions and easy to invent from training data; the smaller and lower-quant the model, the more aggressive the confabulation.
  3. Section 7 attracts unrequested bloat. Three of four Qwen plans added Accessibility / Performance / CSS-quality subsections despite the prompt's explicit “no additional sections” line. Claude's Section 7 stays inside the requested scope.
  4. Plan size is not plan quality. The longest plan (9B at 1,327 lines) ranked last. The shortest local plan (Q4 at 416 lines) ranked fourth. Claude's 782-line plan ranked first. The shape of the work (what claims it makes, how grounded each claim is, whether instruction-following held) dominates the line count.

There's also an inverse pattern worth flagging: although no Qwen plan approached Claude's overall quality, each one caught at least a few real test cases Claude omitted. The 9B's exhaustive enumeration surfaced registration rate-limiting, admin audit-logging, and explicit non-sailing-day counting semantics. Q3 caught the sailor-logs mutual filter clearing and the URL→state direction on tab switching. 35B-A3B caught the yacht_club profile field and broke tie-breaking into discrete tests. About a dozen cases in total: not enough to change the ranking, but enough that the honest takeaway is “Claude's structure plus the Qwen-caught cases,” not “use Claude's plan as-is and discard the rest.”

Agentic-coding tip: for discovery work, run it more than once. For enumeration tasks (test cases, edge cases, failure modes), the variance that hurts when you want one right answer helps when you want a complete one. Even the four weaker plans here each caught real cases Claude missed. Run the same prompt across several models, or just several times against the same one, and each pass surfaces something the others didn't; union the results and hand them to your best model to dedup. You're not picking a winner; you're harvesting coverage.


Phase 2: Implementation, Claude Code vs Qwen 3.6-27B Q3 on the Same Plan

With all five plans written and graded, it's finally time to write some code. Phase 1 showed a clear frontier edge in planning: Claude's plan out-scored every local arm by a wide margin. Phase 2 asks whether that extra capability is still needed once a plan is in hand, or whether the design was the hard part and a capable-enough local model can take it from there. It also partially tests the leading-indicator claim from earlier, that a model's plan quality foreshadows how it implements: if the stronger planner also builds more from the same plan, that's a point in its favor.

To answer that, both arms work from a single combined plan, the synthesis plan (~830 lines), built on the winning Opus plan as its base and augmented with the roughly dozen real cases Claude missed that the local arms caught (the inverse pattern from the last section). Assembling it this way puts that section's advice into practice: aggregate several runs into one result more comprehensive than any single arm produced.

Two models then implement that plan head-to-head, on the same setup as Phase 1: Claude Code on Opus 4.7 (1M context, extra-high reasoning) and Qwen 3.6-27B Q3_K_XL via OpenCode on the same 24 GB GPU. Both get the same implementation prompt, the same gotflashes codebase, and one hard constraint: test code only; the app stays read-only (bar the one artisan seed command the plan itself calls for), so neither can “fix” the app to make a failing test pass. Claude building the plan is the all-cloud baseline; Q3 building it is the cloud-plan / local-execute hybrid, worth flagging on its own: going local isn't all-or-nothing. Keep the cloud for the design it's best at and run the bulk of the work on your own hardware.

Agentic-coding tip: an agent that operates a real app does real things; cap its blast radius. Phase 1 was read-only: agents read the codebase and wrote plans. Implementation is different: the agent runs your app, so anything that app does in production (emails, webhooks, database writes, external API calls), it'll trigger for real, often as a side effect of making a test pass. The risk is sharpest on a hobby setup, where there's usually one environment and the agent inherits whatever real credentials your dev config holds. A safe override only protects you if every entry point loads it: Laravel ships an .env.testing, but php artisan serve won't pick it up unless you boot with APP_ENV=testing, exactly the kind of gap an agent wanders straight through. So prompts asking for safe behavior aren't enough, and neither is a safe config the agent can sidestep; only enforced, infrastructure-layer guards are. Those working in an enterprise setting should have an isolated environment where the agent never sees a production secret. For a hobby project you may need to get more creative: fake credentials, a fail-safe allowlist (the harness I wrote for exactly this), and a check before each run.

An alternative design would have had each model implement its own plan. I rejected that: pairing a weak plan with a weak implementer compounds the two failures and tells you little about implementation ability in isolation. Holding the plan fixed (the same strong plan for both) is what isolates the part Phase 2 is trying to measure.

Why Q3 as the local arm, and not the 35B-A3B that edged it in Phase 1? It was a near-tie (6.3 vs 6.5), and the rubric scored plan-writing; implementation rewards different things. It rewards grounding (real selectors and routes), which was Q3's edge (6 to 4 over the 35B); the 35B's edge was coverage, but a hallucinated #flash-date-picker costs a debug cycle in every spec, where a missing feature area can be recovered by re-reading the code. It rewards headroom: Q3 loads at 163k context, the 35B only 131k on a sliver of live-free memory, and a multi-hour run is where that gap bites. And it rewards reasoning per token: Q3 is the dense 27B with every parameter active, where the 35B fires just ~3B of its experts. Q3 is also the config the setup dialed in, the local arm a real user would run.

Headline results

MetricClaude Code (Opus 4.7, 1M ctx)Qwen 3.6-27B Q3 (OpenCode, 163k ctx)
Total wall-clock (first prompt → last response)3 h 47 min3 h 5 min
Operator-wait time (idle waiting on me)4 min 10 sec26 min 37 sec
Agent-only time (compute and tool use)3 h 43 min2 h 39 min
Operator interventions1 (a single “continue”)7 (6 unblock-stalls + 1 explicit nudge)
Context compactions during the run04 (~70% trigger, 163k window)
Peak context used (single turn)322,672 tokens (1M ctx)~134k (per OpenCode TUI at end)
Median per-turn context222,555 tokens(compaction-clamped)
Spec files written (*.spec.ts)3939
Total test files (specs + helpers, lines)52 files / 2,868 lines53 files / 2,708 lines
Tests authored203140
Passing (final run)11474
Failing5252
Skipped37 (project-gated HTML checks)14
test.fixme() calls179
test.skip() calls816
waitForTimeout calls (prompt forbids)032 across 15 files
App-code touches0 (clean separation)0 (clean separation)
KNOWN-APP-ISSUES.md written?Yes: 5 blockers documentedNo

The local model wasn't the slow one

Strip out the time the agent spent idle waiting for me to type a nudge and the wall-clocks flip: Claude needed 3 h 43 min of compute and tool-use time, Q3 needed 2 h 39 min. Q3 was actually 1 h 4 min faster in pure agent-time. Q3 ran at ~50 tok/s on local hardware; Claude ran on Opus 4.7 API with extra-high reasoning generating a lot more thinking tokens for a lot more tests. So the local model isn't slower in compute; the gap is in what gets produced during that time.

That difference shows up first in coverage. The synthesis plan enumerated 109 test cases across 7 feature areas (auth, logbook, leaderboard, profile, admin, export, cross-feature flows). Both arms had the same plan. Static counts of test(...) calls per feature folder:

Feature areaPlanClaudeQ3Cl/PlQ3/Pl
A. auth19199100%47%
B. logbook222219100%86%
C. leaderboard161713106%81%
D. profile151612107%80%
E. admin293020103%69%
F. export222100%100%
G. flows667100%117%
TOTAL (A–G)10911282103%75%

Claude wrote every plan-specified case and added three more (one each in leaderboard, profile, admin). Q3 stopped 27 short on the plan, with the gap concentrated in the two largest feature areas: auth (10 missing test cases, more than half the section) and admin (9 missing). These are entire test cases the plan named and Q3 simply didn't implement. The pass-rate-table-only view doesn't surface this; an agent could pass 100% of an implementation that covers half the plan, and the rubric should care about both.

What gets produced: Claude wrote 45% more tests (203 vs 140), needed one unblock prompt over the whole run, and never hit context compaction thanks to the 1M-token context carrying the long task. Q3 wrote a narrower suite, hit compaction four times, needed seven operator interventions, and ended at ~65% of Claude's passing-test count.

The asymmetry that matters isn't speed; it's how often the run drags you back in. Q3 needed seven hands-on moments to Claude's one. The idle minutes between them only measure how long I happened to be elsewhere; the real cost is each interruption itself: the context-switch back into a run you'd meant to fire and forget. That's the experience gap.

Compaction was the dividing line

Back in Phase 1, the 27B showed compaction degrading capability by dropping load-bearing context into a lossy summary. Phase 2 made the same point with a different lever. Claude's 1M-token run never approached the compaction trigger over a 3 h 47 min stretch of file reads, test writing, and npx playwright test runs. Q3 at 163k context hit compaction at the 18-minute, 60-minute, 95-minute, and 135-minute marks, and three of those four compactions were immediately followed by an operator “continue” message, because the post-compaction agent stalled instead of resuming.

The 1M ceiling wasn't decorative: Claude's peak per-turn input context across the implementation run was 322,672 tokens (median 222k), well past the 200k Opus standard window. At standard 200k context, Claude would have compacted multiple times; by my count, the median turn alone would have crossed a 70% trigger. The 1M-token ceiling is load-bearing specifically for long-horizon implementation, not for medium-horizon planning. If you're picking where to spend an upgraded-context budget, that's the answer: impl needs it; planning doesn't.

This is the strongest piece of evidence in the experiment that context window size is a capability-level feature, not a comfort feature, for long-horizon implementation tasks.

Test-code quality, dimension by dimension

Pass rate and plan coverage are the loud numbers. The quiet ones are the per-test quality signals: how each agent chose to write the tests they did write. Six dimensions, programmatic counts, mixed picture:

DimensionClaudeQ3Winner
Selector strategy: getByRole(...) calls979Q3
Selector strategy: class-chain locator('.xxx') (lower is better)10526Q3
Domain helpers: utils/livewire.ts waitForLivewireIdle uses150Q3
Assertions per test (expect(...) / test block)1.31.0Claude
Strong-assertion ratio (toBe/toEqual/toHaveText/etc. vs total)24%22%tied
test.beforeEach blocks (per-test setup discipline)195Claude
Test names < 30 chars (lower is more descriptive)1118Claude
: any / as any usages (lower is better TypeScript hygiene)715Claude

Selectors and helpers (Q3 wins, real). getByRole is Playwright's recommended primary locator strategy: it queries the accessibility tree, which is stable across CSS class renames. Q3 wrote 79 getByRole(...) calls; Claude wrote 9. Conversely, Claude reached for class-chain selectors (locator('.daisy-ui-class-name')) 105 times; Q3 used them 26 times. Q3 also invested in utils/livewire.ts (a 92-line helper with a waitForLivewireIdle function used 50 times across specs); Claude's equivalent helper exists but is used once. Q3 wrote the more Playwright-idiomatic test code.

Assertion density (Claude wins). 1.3 expect(...) calls per test vs Q3's 1.0. Concretely: Claude tests average a state-change action plus a multi-property verification; Q3's average is closer to “do the thing, check one observable.”

Test isolation (Claude wins, materially). Claude wrote 19 test.beforeEach blocks for explicit per-test state setup; Q3 wrote 5. Claude's 3.8× more frequent beforeEach usage means more tests own their own preconditions.

Net read. If I were assembling the final suite, I'd graft Q3's getByRole habit and its waitForLivewireIdle helper onto Claude's stronger isolation, denser assertions, and tighter types: mine the local arm for style, keep the cloud arm for discipline. Claude isn't uniformly better; it's that where Q3 wins it's a code-style call, and where Claude wins it's discipline.

Agentic-coding tip: a repeated mismatch belongs in CLAUDE.md / AGENTS.md, not a per-file correction. Most of these dimensions are preferences: getByRole over class chains, a beforeEach per spec, no : any. When a model keeps making the same call you wouldn't, that's not a bug to fix test-by-test; it's a missing line in your agent-instructions file. Write the rule down once (CLAUDE.md for Claude Code, AGENTS.md for OpenCode and most others) and it steers every file the agent touches afterward. My rule of thumb: a divergence you see once is a correction; the same one three times is a documentation gap.

Constraint compliance: the prompt asked for no waitForTimeout

Claude used waitForTimeout zero times across its 39 spec files. Q3 used it 32 times across 15 files, including in its own utils/livewire.ts helper, so the pattern is baked into Q3's wait strategy, not isolated to a handful of tests it gave up on. Those 32 waitForTimeout calls are the flake-source in Q3's suite that doesn't show up in the pass-rate table. On a faster CI, Q3's tests would start failing in patterns Claude's wouldn't, even before any code change.

The clearest illustration is that same utils/livewire.ts helper: the one the quality table just credited Q3 for leaning on 50×, and whose whole job is to replace waitForTimeout with observable-state waiting. Both arms wrote one, and both stub the actual idle-detection (Livewire 3 exposes no idle flag); the fallback is the tell.

Claude: utils/livewire.ts

1export async function waitForLivewireIdle(page: Page, timeout = 10_000): Promise<void> {
2  await page.waitForFunction(
3    () => {
4      const wire = (window as any).Livewire;
5      if (!wire) return true;
6      return true;
7    },
8    { timeout },
9  );
10  await page.waitForLoadState('networkidle');
11}

Q3: utils/livewire.ts

1export async function waitForLivewireIdle(page: Page, timeout = 15_000): Promise<void> {
2  const livewireDone = page.waitForFunction(
3    () => {
4      const lw = (window as any).Livewire;
5      if (typeof lw === 'undefined') return true;
6      const contexts = lw.initial || lw.numpy;
7      if (contexts) {
8        return true;
9      }
10      return true;
11    },
12    { timeout: 2000 }
13  );
14  await Promise.all([
15    livewireDone.catch(() => {}), // Ignore timeout
16    page.waitForTimeout(1500),    // Give Livewire time to process
17  ]);
18  await page.waitForTimeout(500);
19}

Claude falls back to waitForLoadState('networkidle'). Q3 hardcodes two fixed sleeps into the very helper meant to avoid them, and reaches for an lw.numpy property that doesn't exist on Livewire (every branch returns true regardless, so the “state check” is decorative). That helper is called 50× across the specs; it's why the waitForTimeout count is structural, not a few stragglers.

This is the gap between “the agent compiles and runs” and “the agent reads the prompt carefully.” Q3 produced tests; Claude produced tests that follow the constraints the operator set.

Which sharpens the tip above, and points at something deeper than style. Documenting a preference in CLAUDE.md or AGENTS.md only helps with a model that honors its instructions, and that's exactly why Claude's gaps are fixable: it follows the rules it's given, so a new rule lands. Q3 ignored an explicit, prompt-level constraint 32 times; one more line in AGENTS.md wouldn't have changed that. Claude's misses are preference gaps a rule closes; Q3's are rule-following gaps no rule reaches. The cloud model's behavior is steerable in a way the local one's isn't.

What Q3's pass/fail board actually tells you

74 passing, 52 failing reads like an ordinary mid-run result. Look closely at both columns, though, and neither means quite what the number suggests: the red overstates how many distinct problems there are, and the green overstates how much actually works.

The red. The two arms' 52 failures cluster very differently:

Claude's 52 failuresQ3's 52 failures
21 — admin dashboard (wire:model.live selector timing)18 — date picker + Livewire idle timing (across 6 specs)
15 — SESSION_SECURE_COOKIE blocks fresh contexts (documented, fixme'd)11 — profile pages (personal-info 5, district-fleet 3, banner 1, etc.)
6 — profile/form selectors (regression in last in-session run)10 — JS-integration (TomSelect 6, toast 4)
5 — calendar date exhaustion (seed fills most allowed dates)13 — HTML-validation + assorted (HTML 5, nav 2, password-toggle 2, etc.)

Claude's failures cluster around specific, named technical blockers, four of which it documented in KNOWN-APP-ISSUES.md (next section). Q3's cluster around a single root cause repeated across surfaces: the date-picker + Livewire idle-timing pattern (a waitForTimeout/wait problem) accounts for 18 of its 52 failures across 6 spec files. So Q3's 52 isn't 52 problems; it's largely one fix it found late and never generalized. Claude's failures look like a mature suite hitting environmental edges; Q3's look like an apprentice that discovered the fix and didn't go back and apply it.

The green, and this is the part that should worry you. Q3 wrote tests that pass on the day they were written, not in general. Several of its passes aren't evidence the tests are right; they're latent date-value bugs (a different failure class from the idle-timing cluster above) that just didn't happen to fire on the day the suite ran. Both arms got the same warning about dynamic dates and the same domain quirks (January grace period; future dates beyond today+1 rejected), and both exported the same date helpers (todayISO, isoDaysAgo, isoDaysAhead). Only Claude used them everywhere. Three places Q3 didn't:

1. logbook/edit.spec.ts and logbook/delete.spec.ts reach past their own helpers. Q3 wrote a private logFlash(page, activityType, dayOfMonth) function that constructs dates as new Date(now.getFullYear(), now.getMonth(), dayOfMonth) directly, bypassing the helper module that owns date computation. Called with dayOfMonth values of 2, 3, and 6. If you run these tests on day 1–5 of any month, the seeded date is in the future and the app rejects dates beyond today+1.

2. E2eSeedCommand::seedManyFlashes silently creates 0 or 1 flashes instead of 20, every month of the year. The loop is while (count($dates) < 20 && $d->lte($minDate)) with $d starting at “Jan 1 of current year” and the bound being $minDate (the earliest allowed date). The fix is one character ($maxDate instead of $minDate). The pagination tests that depend on this scenario can't possibly hit their threshold.

3. Grace-period test handling diverges. The plan called for a “returns 403 when posting an edit for an out-of-range flash” test, which requires seeding a flash with a date the app would normally reject. Claude marked this test.fixme() and added an entry to KNOWN-APP-ISSUES.md explaining why test code can't bypass the DateRangeService validator. Q3 wrote the test, but it uses waitForTimeout(1000) and the same wire.set('dates', ...) routing pattern that broke 44 times against the wrong Livewire component.

None of this dents the 74-passing count. Bug 1 passes on May 26 and fails in the first week of any month; Bug 2 fails every day, but Q3 logged its symptoms as generic “leaderboard” failures and never traced them. So both columns mislead in the same direction: the red is mostly one un-generalized fix wearing 18 disguises, and the green is a snapshot of one day's luck. Claude's board is the opposite: its red is a labeled to-do list, and its green holds up next month because it ran dates through the helpers everywhere and wrote down what it couldn't clear. That last habit (documenting a dead end instead of leaving it to surface later) is the next section.

The KNOWN-APP-ISSUES.md story

Claude Code's end-of-run delivery summary table showing categories (Config/setup, Smoke, Auth, Logbook, etc.), file counts per category, test counts including which are marked test.fixme, and a categorised breakdown of the 52 failing tests.
Claude Code's delivery summary at end of the implementation run: 55 files committed across 12 feature areas, with per-area test counts and the four named failure clusters.

Claude wrote a 5-entry KNOWN-APP-ISSUES.md documenting limitations it ran into: hashed password-reset tokens that test code can't extract, rate-limiter state with no test-only reset hook, grace-period validation that rejects seeded historical dates, SESSION_SECURE_COOKIE=true blocking HTTP fresh-context logins (affecting 15 tests), email-change verification tokens not accessible from tests. Each entry names the affected spec, the underlying issue, the suggested fix, and links the tests that use test.fixme() to point back to the entry. One entry, verbatim:

1## 4. SESSION_SECURE_COOKIE Blocks Fresh Context Login Over HTTP
2
3**Files:** `tests/e2e/flows/milestone-progress.spec.ts`, `tests/e2e/flows/register-and-profile.spec.ts`, `tests/e2e/logbook/empty-state.spec.ts`, `tests/e2e/export/user-data-csv.spec.ts`
4**Issue:** The `.env` has `SESSION_SECURE_COOKIE=true`, which sets the `Secure` flag on session cookies.
5The E2E test server runs on HTTP (port 8001), so browsers in fresh `browser.newContext()` contexts silently drop the session cookie after login — the POST succeeds but no session is established for subsequent requests.
6Tests that need to log in as a different user (tiered users, fresh user) in a fresh context are blocked by this.
7**Workaround:** Tests that need different users should pre-cache their storageState in `auth.setup.ts`, or the test environment should set `SESSION_SECURE_COOKIE=false`.
8**Status:** `test.fixme()` for affected tests
9

One of five entries. Full file: KNOWN-APP-ISSUES.md.

Q3 didn't create the file at all. When Q3 hit blockers, it retried until the operator stepped in, covered in detail in Q3's self-direction problem below.

Postscript: structured failure reporting led me to a solution Claude didn't see. When I sat down to plan closing out the remaining failures, I read through KNOWN-APP-ISSUES.md carefully (the password-reset token hashing entry, the rate-limit-state entry, the SESSION_SECURE_COOKIE entry), and partway through the list, it clicked: most of these are Laravel-side test affordances that Pest 4's new browser-testing primitives already provide. The conclusion didn't come from Claude. Claude didn't know about Pest 4 specifically; it didn't propose a framework migration as the fix. What it did do was lay the constraints out in a form clean enough that a better path became visible to me as a reader. The shape of Claude's report did more analytical work than its content. This is the underrated half of “good failure documentation pays you back”: sometimes the payoff isn't a fix the agent could write, it's a fix you can see because the agent wrote the constraints down well.

Q3's self-direction problem: looping, then stopping short

The back half of Q3's run exposed a self-monitoring gap: the agent couldn't tell when it had stopped making progress. It surfaced twice: once mid-run as a blind retry loop, and once at the very end as a premature stop.

Mid-run (around step 83, about 90 minutes in), Q3 wrote a test for the multi-date logbook page that called comp.set('dates', [d]) directly via page.evaluate. The /logbook route renders multiple Livewire components on the same page, and Q3's call routed to the wrong one (email-verification-banner), which has no $dates property: a clean HTTP 500 with a PublicPropertyNotFoundException in storage/logs/laravel.log.

Q3 re-ran the failing test. Same error. Re-ran it again. Same error. The structured Laravel log accumulated the same exception 44 times. The agent's main loop went idle at step 83 with no fix applied and no recognition that it was looping.

Notably, this isn't the model being incapable: Q3 found the right diagnosis the moment I handed it a hint. The failure is self-monitoring: Q3 treated each retry as fresh, reading the error, trying a tweak, failing, then treating the next attempt as a new problem. Nothing in its loop recognized “I've seen this exact exception ten times in a row; my approach isn't converging.” On a hands-off run, that meant I had to step in with the actual diagnosis (three sentences, one architectural hint), which unlocked another ~50 minutes of useful work before the next stall. So the operator-burden gap is bigger than the bare intervention count suggests: Q3's nudges weren't attention-cost “continue”s, they were real diagnostic work, reading logs, spotting the wrong-component routing, handing back the fix.

The second instance showed up at the very end, and it wasn't obvious until I read the session export carefully. After the seventh and final nudge, Q3's last 50 minutes looked like this:

Q3's OpenCode TUI at session end, context bar showing 134.3k of 163k tokens used and a summary categorising the remaining failures by date-picker, TomSelect, profile, etc.
Q3's OpenCode TUI at session end: 134.3k / 163k context (~82% full), final summary categorising the remaining failures, the “I should wrap up” framing the operator accepted as the stop signal.
  • Tightened its own Playwright timeout from 60 s to 30 s (a config change, not a fix).
  • Re-ran the suite. New numbers: 74 passing, 52 failing, slightly worse than the in-session baseline of 75 / 51 it had ten minutes earlier. The shorter timeout pushed one passing test into the “timeout” bucket.
  • Diagnosed the issue accurately: “tests pass in isolation but time out in the full suite due to session expiry from the global-setup storage state becoming stale during the ~1.5 minute run.”
  • For the 16 date-picker tests: noted “need the same pickDates + waitForLivewireIdle fix applied.” Explicitly identified the fix, knew where to apply it, did not apply it.
  • Wrote a status summary and stopped.

That is not “Q3 completed the run.” Q3 made the suite marginally worse, identified what would fix its largest failure cluster, declined to apply that fix, and self-declared finished. The 74/52/14 result is Q3's self-chosen stopping point, not its best effort on the plan.

Claude's run had no analogue. Its single operator intervention was a bare “continue” 2 h 14 m in: an unblock for a paused-but-not-stuck state, not a diagnosis for a misunderstood error. It ran longer, hit no compaction, committed 55 files with a delivery message, and documented in its 5-blocker KNOWN-APP-ISSUES.md exactly why each deferred test was deferred. Q3 looped, stopped short, and left no such record.

What plan-driven testing can't reach: three real bugs neither suite caught

I know three production bugs in this app that neither agent's tests would catch, and they sort into two kinds, the second of which plan-driven testing structurally can't reach.

Bug 1: the automated database backup doesn't work. The app ships a db:backup artisan command and a Laravel scheduler entry. Backup is broken in production. grep -niE 'backup' tests/ on either tree returns zero hits inside spec files.

Bug 2: logging in after a long idle returns a “Page Expired” (HTTP 419) instead of authenticating cleanly. Laravel's default session and CSRF-token lifetime is 120 minutes; a login form left open longer than that has a stale _token value that Laravel rejects before evaluating the credentials. Neither arm tested the idle-expiry path.

Bug 3: the CSV export and the leaderboard can credit the same year's flashes to different fleets. ExportController resolves a flash's club affiliation by exact-year match (no carry-forward); the leaderboard and User::membershipForYear() carry forward the most recent prior membership when a year has no row of its own. So for any sailor with such a gap, the same flashes can land in different fleets/districts on the export than on the leaderboard. Each path is individually “correct”; they just encode different answers to “what fleet was this sailor in that year?”

All three are real defects; what differs is why the suites missed them. For Bugs 1 and 2 the miss is a documentation-chain gap, not an agent-capability gap: each agent tested what the plan named, and the plan was derived from the PRD. The backup command exists in the codebase but never made it into docs/prd.md or CLAUDE.md, so nothing pointed the plan at it; the idle-login “Page Expired” is a corner of a documented feature the PRD covers only at a high level. Neither miss is the model being dim: it's the model faithfully covering the inputs it was handed, and the bug wasn't in them. Bug 3 is a different kind of miss: a cross-feature consistency bug. Each feature behaves as documented in isolation; the defect lives in the seam between them, where two code paths answer the same question differently. No per-feature plan catches it, because no single feature is wrong: you'd only find it by asserting that export and leaderboard agree, which takes a human who suspects they might not.

The broader takeaway: the chain “plan → tests → coverage” is only as strong as its weakest link, and that link is upstream of any model choice. Bugs 1 and 2 trace to a documentation chain that didn't carry the feature forward; Bug 3 shows the ceiling is lower still: some bugs live in the seams between features that are each individually correct and individually documented, and no plan built feature-by-feature thinks to test the seam. The Phase 2 experiment runs Claude vs Q3 on a level playing field; both arms got the same plan; the plan had these holes; both arms had them. The model swap fixes none of it.


The Structural Gap: Subagents and Parallelism

The most under-appreciated difference between Claude Code and local agents isn't model quality; it's task delegation. During this experiment's Playwright planning task, the Claude Code (Opus 4.7) run spawned three subagents to research different parts of the codebase. The Qwen arms running through OpenCode could not do this; everything was one linear conversation in one context window. Two practical consequences:

1. Subagents are effectively “free” context. When Claude spawns a subagent to investigate a directory or summarize a set of files, the subagent runs in its own fresh context window. Only the summary it returns counts against the main conversation. A planning task that pulls in 200k tokens of file content via three parallel subagent reads might leave the main context window holding only 5–10k tokens of summary. The user's effective working context is multiplied without changing the model or hardware.

Locally, every file read goes into the same context budget. I watched OpenCode hit context compaction once on the 35B-A3B at 131k context after reading the same set of files, and compaction summarizes prior context away, losing detail. On a Claude subscription using subagents, the same exploration would barely move the main context's needle.

2. Parallel task execution is real wall-clock savings. Three subagent researches in parallel finish in roughly the time of the slowest one, not the sum. On broad codebase exploration this can be a 3x speedup with no model change. A local rig can't even spin up a second OpenCode session without doubling VRAM, which a 24 GB GPU can't do for these model sizes.

This is the gap that doesn't show up in benchmark charts. Claude isn't just smarter per token; it's a fundamentally different agent architecture. For workflows that can be decomposed into parallel subtasks (exploration, multi-file refactors, comparative research), the cloud lead is structural, not marginal. Local LLMs are competing on raw model capability with a hand tied behind their back.

Not every cloud edge is a capability edge, though. Prompt caching, for instance, lets the cloud re-read its growing context at a fraction of the input price each turn, so long sessions stay cheap to run: a cost advantage, not a quality one. Locally it's moot: you don't pay per token, and LM Studio reuses the KV prefix for the same prefill benefit. Worth separating out, because unlike the subagent gap this one says nothing about which model writes the better plan.

Agentic-coding tip: parallel agents need a parallel dev setup. The flip side of all that parallelism: the moment you run two agents at once (two sessions, or a cloud arm and a local arm side by side), they collide on the single-tenant assumptions your dev box was built on. Here, both implementers defaulted their Playwright webServer to php artisan serve on 127.0.0.1:8000; the second silently failed to bind, every test failed on navigation, and the agent, with no way to see the real cause, started “fixing” a Playwright config that was never wrong. Give each agent its own port, database, and scratch directory up front (I now pin --port=8001 / 8002 in the prompt). Anything two agents share, they will eventually fight over.

Should you go local?

QualityParallelismSpeedPrivacyAvailabilityControl
  • Local (Qwen 27B Q3)
  • Claude Code
The actual decision, as a shape: Claude owns quality, parallelism and speed; local owns privacy, availability (offline / no rate limits) and control (own the stack, tinker, no external dependency). Unlike the grading radars above, these axes cross. The privacy / availability / control side is genuinely “which corner you value,” but Quality and Parallelism aren't a wash: there it's plainly “who's better,” and it's Claude. Opinionated and qualitative, not measured. Speed folds in operator burden; cost cuts both ways depending on how much you run, so it isn't a clean axis here.

Read the shape honestly: going local means accepting a real step down in coding capability (and the hard structural ceiling of one model, one session, no subagents) in exchange for privacy, control, and independence from anyone else's infrastructure. With that trade on the table:

Go local if:

  • You have strict privacy or air-gap requirements where code-leaves-machine is non-negotiable
  • You enjoy tinkering with inference optimization (the VRAM-vs-context-vs-quant trilemma is genuinely interesting)
  • You want zero dependency on external services (rate limits, outages, billing)
  • You're comfortable with the structural gap: no parallel subagents, no parallel sessions, one model at a time, one conversation at a time
  • The Q3-style “best Qwen at the task” outcome (~6.3 on a 10-scale rubric, second-tier failure modes) is acceptable for your work

Stick with Claude Code if:

  • You want the best coding quality available: the grade gap is real, not marginal
  • You value your time. The hour spent debugging VRAM pressure is gone forever
  • You need concurrent sessions, parallel subagents, or any kind of context-multiplication via delegation
  • You're working on multi-file refactors where the agent needs to hold multiple files in mind simultaneously: Claude's subagent architecture handles this; locals can't

One note on cost, since the radar leaves it out. Priced at Anthropic's published Opus rates — the 1M-context window bills this experiment's many 200K+ token turns at the long-context premium — the deduplicated token usage (via ccusage) puts the Claude side at roughly $300 of API spend (about $175 before that premium), nearly all of it cache reads re-processing the growing context. (ccusage's own total comes out far lower, ~$55, because it has no list price for this model and falls back; $300 is the figure at Opus's actual rates.) That's about three months of the $100/month subscription I pay, so for anything past very occasional use the subscription is the cheaper path, and metered API only wins if you'd run this a couple of times a year. Either way, that's no reason for an individual to buy a $2,000 GPU; Claude is cheap at this cadence, and local was never the cost play. The economics only flip at the other end: enterprises don't get the Max plan, so a team is on metered API or an enterprise contract where that per-run cost is live and compounds with volume — and a privacy- or compliance-bound org already eyeing self-hosting is who the $0-per-run rig genuinely pencils out for.

The practical hybrid workflow for those with the hardware: run Claude Code for serious agentic work (planning, multi-file refactors, agentic test generation), and reach for local when you specifically need offline / privacy / no-cloud-dependency on a focused single-task session. For that local arm, the rule that falls out of this experiment is simple: run the biggest dense model you can fit while still leaving enough context to finish the task without compacting. Density is what buys reasoning (every parameter active on every token), and the context headroom is what keeps a long run from tripping the compaction that quietly costs you capability. On 24 GB that points straight at the dense 27B at a quant that clears ~160k context, the Q3 config this experiment kept coming back to. The 35B-A3B MoE is for fast scratch work only (wide context, lower per-spec accuracy), and the 9B is a pure long-context fallback you'll discard a lot of.

Conclusion: impressive, not yet a daily driver

The hypothesis I opened with (plan quality predicts implementation quality) held: the model that wrote the best plan wrote the cleanest implementation, and Q3, the strongest of the local plans, was the strongest of the local implementers. Phase 1's cheap signal was a real one, and the cheapest practical takeaway in the experiment falls out of it: to size up a model for your own codebase, read the plan it writes before committing hours to an implementation that will inherit the same flaws. But the headline is the bigger question the whole experiment was chasing, and the honest answer is two-sided: what a single 24 GB card buys you for agentic coding is genuinely impressive, and not yet a daily driver against the cloud.

The impressive half first. Q3 took Claude's plan and turned it into a working Playwright suite (74 passing tests) in 2 h 39 min of agent-only compute, on the same GPU that was driving my monitors. And its output isn't a worse-shaped Claude implementation; it's a different shape: more Playwright-idiomatic in its selectors, more invested in domain helpers, just lower in coverage and looser on discipline. A year ago I wouldn't have bet on a local model getting this close on a real codebase.

The “not yet” lives in four things the post built up. Constraint compliance: Q3 ignored an explicit, prompt-level rule 32 times, and that's the un-fixable kind, because a line in AGENTS.md only steers a model that follows the rules it's already given. Trust: some of its passing tests pass on the day they were written, not in general: latent date bugs, and fixes it found but never generalized. Self-direction: it looped on one error 44 times, stopped short of a fix it had already identified, and cost seven operator interventions getting there. And the structural gap: no subagents, no parallel sessions, so it competes on raw model capability with a hand tied behind its back.

That's why the hybrid (cloud to plan, local to execute) only half works. You get a suite at the end, real coverage for a few hours of local compute; you don't get something you'd ship without a full review pass. And the parts that are broken can take longer to hunt down than they'd have taken to write right. It's a real workflow for a focused, private, single-task session, not a hands-off one.

And if there's one transferable lesson from running five arms at the same task, it's that the artifact worth carrying forward was never any single plan: it's the synthesis. When you compare models on real work, the deliverable is rarely the best single output; it's what you get from stitching the wins together.

I opened the whole thing by saying this isn't a cost play, and the numbers bear that out: the gap to Claude is real, and on a dev machine you pay for it in VRAM math and debugging time. But that was never the point for me: I wanted to know whether the self-hosted setup I enjoy tinkering with is viable for real work, and the two-sided answer above is a better one than I'd have bet on a year ago. The trend line is the encouraging part. The whole experiment (the five plans, both test trees, the grading, the raw session logs, warts and all) is in the companion repo if you want to run your own.

Say Hello! If you found this helpful, spotted something that could be improved, or just want to say thanks—I'd love to hear from you. Shoot me an email at [email protected]. Being a self-hosted blog I don't have any good metrics of readership, so hearing from real people is the best way to know this content is reaching someone. If enough people are interested, I'll get a comment system going.

× Image Modal

Placeholder