Small Brains, Big Test: Can Small LLMs Pass the CISSP?

Posted on February 24, 2026 by John

LLMs Cybersecurity Claude Ollama

I've been meaning to prep for and take the CISSP exam for a while now. This work does not advance that goal, but it's easy to pretend it does. Sixteen small open-weight models, 1,303 practice questions, 64 configurations. The results reveal a clean capability ceiling, a contamination pattern hiding in plain sight, and a handful of questions that stumped every single model, including the ones that probably cheated.

Why Small Models Matter

Frontier models require an estimated 20x–50x the computation of an 8B model that runs locally. That's not a cost problem you can throw money at. For workloads like real-time log analysis or high-throughput classification, the math doesn't work regardless of budget. Faster hardware helps, but a 20x compute multiplier doesn't shrink with a better GPU; it shrinks with a smaller model.

Most enterprises aren't there yet. The current priority is adoption and enablement, and frontier API calls are the path of least resistance. But the workloads that actually need AI at scale, private deployments, edge compute, high-volume inference, will require smaller, purpose-built models running on local hardware.

Here I benchmark sixteen popular small open-weight models, assess their capabilities, and evaluate how they hold up when quantized for constrained hardware. All of it runs on my personal desktop.

Why the CISSP?

The CISSP (Certified Information Systems Security Professional) is a legitimate benchmark. It's a notoriously broad and difficult exam: eight domains spanning risk management, cryptography, identity systems, secure development, and more. I wouldn't go as far as to say passing the CISSP qualifies a model for a security task, but it's a meaningful indicator in that direction.

This post establishes the baseline: how well do popular small open-weight models perform on CISSP questions, zero-shot, with no fine-tuning? The answer will tell us which models are worth investing fine-tuning compute in, and how much room there is to improve.

Model Selection

I limited my search to models fitting the following criteria:

Open-weight
8 billion parameters or fewer
Popular first-party base models (no community fine-tunes)
Available officially through Ollama (for convenience)

An exception was made for SmolLM3, which was imported from HuggingFace GGUF. As the only open-data model on the list, it felt important to include.

An exception was also made for DeepSeek R1 8B, which is a fine-tuned variant of Qwen3 rather than an independent base model. It was included because it is a highly-popular officially released DeepSeek model available through Ollama.

The hard cap of 8B parameters was set with future fine-tuning and NPU deployment in mind. Models up to 12B would likely fit on my 4090 at full precision, but would be difficult or inviable for future plans.

After filtering, sixteen models spanning eight families and three size classes (1–2B, 3–4B, 7–8B) made the cut.

Click any row to include or exclude that model from the charts and tables below.
Model	Params	Provider	License	Training Cutoff	Released	Data Risk†
Llama 3.1 8B Instruct	8B	Meta	Llama 3.1	Dec 2023	Jul 2024	Clean
Llama 3.2 3B Instruct	3B	Meta	Llama 3.2	Dec 2023	Sep 2024	Clean
Llama 3.2 1B Instruct	1B	Meta	Llama 3.2	Dec 2023	Sep 2024	Clean
Gemma 3 4B IT	4B	Google	Gemma	Aug 2024	Mar 2025	Possible
Gemma 3 1B IT	1B	Google	Gemma	Aug 2024	Mar 2025	Possible
Gemma 3n E4B IT	8B/4B*	Google	Gemma	Jun 2024	Jun 2025	Borderline
Gemma 3n E2B IT	6B/2B*	Google	Gemma	Jun 2024	Jun 2025	Borderline
Phi-4 Mini	3.8B	Microsoft	MIT	Jun 2024	Feb 2025	Borderline
Mistral 7B v0.3 Instruct	7B	Mistral AI	Apache 2.0	Undisclosed	May 2024	Clean
Ministral 3 8B Instruct	8B	Mistral AI	Apache 2.0	Undisclosed	Dec 2025	Unknown
Ministral 3 3B Instruct	3B	Mistral AI	Apache 2.0	Undisclosed	Dec 2025	Unknown
DeepSeek R1 8B (Qwen3)	8B	DeepSeek	MIT	Undisclosed	May 2025	Unknown
Qwen3 8B	8B	Alibaba	Apache 2.0	Undisclosed	Apr 2025	Unknown
Qwen3 4B	4B	Alibaba	Apache 2.0	Undisclosed	Apr 2025	Unknown
Qwen3 1.7B	1.7B	Alibaba	Apache 2.0	Undisclosed	Apr 2025	Unknown
SmolLM3 3B	3B	HuggingFace	Apache 2.0	Jun 2025	Jul 2025	Documented

*Gemma 3n uses selective parameter activation (MatFormer): total parameters are 8B (E4B) and 6B (E2B), but only 4B/2B are active at runtime. Memory footprint reflects the full parameter count.

†Data risk relative to ISC2 Practice Tests publication (June 2024), see The Contamination Question.

Notable Exclusions

Two candidates were excluded for producing unstructured output, a problem that would follow them into any inference pipeline. Phi-4 Mini Reasoning ignores think: false and spends its entire token budget on chain-of-thought before producing an answer letter. The older DeepSeek R1 Qwen distills (7B and 1.5B, January 2025) emit verbose multi-paragraph explanations rather than concise answers, averaging 54–98 tokens per question versus ~2 for well-behaved models, with the 1.5B hitting the output cap on 29% of questions. The newer R1 8B (May 2025, Qwen3-based) follows concise output instructions correctly and is included.

Quantization Levels

Quantization: Reducing model weight precision from 16-bit floats (FP16) to lower bit-widths (ex: Q8_0) to shrink memory footprint and improve inference speed, at some cost to accuracy.

Reading the notation: The number is the bit-width. _0 denotes simple uniform quantization. _K denotes K-quants, importance-aware quantization that selectively preserves higher precision for weights that matter most. The trailing _M (medium) and _S (small) indicate the super-block size variant; _M typically preserves slightly more accuracy at a marginally larger file size.

Every model was evaluated at FP16, Q8_0, and Q4_K_M. Llama 3.x and Mistral 7B also have Q6_K, Q5_K_M, Q5_K_S, and Q4_K_S available in Ollama's library, allowing a finer-grained view of the accuracy-vs-compression curve.

All models use publisher-provided GGUF files converted from BF16, a training optimized representation, to FP16 weights.

Security and Alignment Warning: Qwen and DeepSeek

The Qwen and DeepSeek model families have numerous and severe security risks and are not recommended for use. They are included due to their popularity.

Both families feature robust censorship aligned with the Chinese Communist Party baked directly into the model weights. Any deployment should consider the direct negative impact of this censorship as well as the risk of additional unknown alignment directives.

The DeepSeek iOS app is effectively spyware, providing ample justification to treat all of their products as poisoned. While hiding backdoors or malware directly in model weights would be difficult, it is not impossible for a sophisticated state-sponsored adversary.

Both models are evaluated here on locally-hosted GGUF weights via Ollama. Their results are reported honestly, including the high accuracy scores, which may owe more to training data contamination than architectural merit. They are included for completeness of the capability comparison. Again, I strongly recommend these models not be used for production deployments, especially in security-adjacent workloads.

Methodology

Extraction of the evaluation set and creation of the test harness were ~~vibe coded~~ engineered with Claude Code.

ISC2 Official CISSP Practice Tests (2024)

The evaluation set is the 2024 ISC2 Official CISSP Practice Tests: 1,306 questions covering all eight CISSP domains.

Preparing the dataset required significant cleanup. 53 image-dependent questions were recovered by incorporating visual content as ASCII diagrams, Mermaid flowcharts, or inline text. Errors touching roughly 30% of the dataset required manual or AI-augmented cleanup. The final dataset was verified through multiple independent passes: automated mismatch detection, source cross-referencing, and post-evaluation analysis where questions that every model answered incorrectly were investigated individually to distinguish genuinely hard questions from data errors. Even with Claude's help automating the detection and cross-referencing, verification took over a full work day.

After filtering out 3 remaining image-dependent questions, I arrive at 1,303 evaluation questions:

803 domain-tagged questions from chapters covering individual domains (Domains 1–8)
500 practice test questions from mixed-domain practice exams (chapters 9–12), tagged as domain "Unspecified"

This includes all three question types: 1,246 single-answer, 44 multi-select, and 16 matching questions. The matching questions, which ask the model to pair numbered items with lettered descriptions, are included in scoring and reveal an interesting failure mode discussed below.

Domain	Exam Weight	Questions
1. Security and Risk Management	15%	100
2. Asset Security	10%	105
3. Security Architecture and Engineering	13%	101
4. Communication and Network Security	13%	100
5. Identity and Access Management	13%	100
6. Security Assessment and Testing	12%	100
7. Security Operations	13%	100
8. Software Development Security	11%	100
Practice Tests (mixed)	—	500
Total (after filtering 3 image-dependent)		1,303

Harness & Scoring

Each model was prompted with a standardized format. For single-answer questions (95% of the set):

You are taking the CISSP certification exam. Answer the following multiple-choice question.

QUESTION:
[question text]

OPTIONS:
A. [option text]
B. [option text]
C. [option text]
D. [option text]

INSTRUCTIONS:
- This question has ONE correct answer
- Respond with ONLY the letter (A, B, C, or D)
- Do NOT include explanations, reasoning, or any other text
- Format: A

Example response: C

YOUR ANSWER:

Multi-select questions used adapted instructions asking for comma-separated letters.

Key evaluation parameters:

Ollama API with temperature=0.0 (deterministic) and num_predict=150 (short answers)
Answer extraction via regex with multiple fallback patterns, confirmed with LLM-as-judge validation and human review of disagreements
Scoring: exact match for single-answer, exact set match for multi-select (no partial credit)
VRAM monitoring via nvidia-smi polling during inference
Hardware: NVIDIA RTX 4090 (24GB), Ollama running natively on Windows, harness calling localhost:11434

Results below are from a single trial run. A second full trial confirmed that accuracy is perfectly reproducible at temperature 0: every model returned identical scores across both runs, with the sole exception of the Ministral family, which showed minor non-determinism. Performance metrics (tok/s, wall time) varied slightly between runs depending on background system load; I report numbers from the cleaner of the two runs. The two trials totaled roughly 7 hours of GPU compute time.

Results: The FP16 Baseline

The primary benchmark runs every model at FP16 with no quantization to establish the baseline. This measures each model's true knowledge ceiling before compression artifacts enter the picture. Use the precision selector to compare Q8_0 and Q4_K_M results in the same table; the tier analysis below uses FP16 as the reference.

Row shading reflects accuracy.
*Total/active params
Model	Params	Accuracy	Correct / 1,303	VRAM (GB)	Tokens/sec	Wall (min)

Random guessing on four-option multiple choice gives 25%. With that baseline in mind, the results split into clear tiers:

Tier 1: 75–82% (undisclosed training cutoffs). Ministral 3 8B leads at 81.9%, followed closely by Qwen3 8B at 80.8%. DeepSeek R1 8B and Qwen3 4B round out this tier at 76–77%. These are the models with undisclosed training cutoffs released in 2025, more on this in a moment.

Tier 2: 70–75%. Llama 3.1 8B is the clean anchor at 72.8% with a definitively pre-test training cutoff (December 2023). Ministral 3 3B outscores it at 74.5% but carries the same undisclosed-cutoff caveat as its 8B sibling. Gemma 3n E4B (74.3%) and E2B (71.0%) are borderline: their June 2024 cutoff lands right around the book's publication date. Gemma 3 4B rounds out the tier in the low 70s.

Tier 3: 59–69%. Qwen3 1.7B is the standout here: 69.0% from just 1.7B parameters is remarkable. Phi-4 Mini (65.1%) and Llama 3.2 3B (63.5%) round out the viable middle. SmolLM3 3B (63.5%) lands at the bottom of this tier, consistent with expectations for a 3B model trained on open datasets (FineWeb-Edu) rather than domain-specific technical material.

Tier 4: Below 55%. Mistral 7B v0.3 at 49.6% is the biggest disappointment in the lineup. A 7B model that scores below a coin flip and gets outperformed by Qwen3 1.7B, a model with 4x fewer parameters. It's hard to justify 7B of VRAM for performance that a 1.7B model beats by 20 points. Gemma 3 1B at 45.8% is respectable for 1B parameters but not useful out of the box.

Below viable. Llama 3.2 1B (25.0%) is statistically indistinguishable from random guessing.

The Contamination Question

Remember those training cutoff dates? The ISC2 Official Practice Tests were published in June 2024. That date defines the Data Risk column in the model table above:

Clean: cutoff definitively predates the book
Borderline: cutoff at or near publication date
Possible: cutoff after publication
Documented: post-publication but full training recipe published
Unknown: cutoff undisclosed

The pattern is hard to ignore. Controlling for size class, models with confirmed pre-June-2024 cutoffs top out at 72.8% (Llama 3.1 8B) and 74.3% (Gemma 3n E4B). Models with undisclosed cutoffs in the same size class jump to 77–82%.

SmolLM3 is an interesting control. Its June 2025 cutoff is well after the book's publication, but HuggingFace documents the full training mixture: FineWeb-Edu (70%), Stack-Edu-Python (20%), and FineMath (10%): curated web content scored for educational quality, not book corpora. CISSP study material from websites, forums, and Quizlet pages could easily appear in FineWeb-Edu, but the questions themselves may not. SmolLM3's 63.5% at 3B matches Llama 3.2 3B (63.5%, Dec 2023 cutoff) almost exactly, reinforcing that genuine 3B capability without contamination lands in the low 60s.

The cleanest high score is Gemma 3n E4B at 74.3% with a June 2024 cutoff, where training data collection almost certainly predates the book's publication date. Llama 3.1 8B at 72.8% with a December 2023 cutoff is definitively clean. These two models, from different providers and architectures, landing within 1.5 points of each other suggests that ~73–74% is the genuine capability ceiling for 4–8B models on this material without contamination.

Advancements in this field happen rapidly, and newer architectures do tend to score higher across the board. But a consistent 8-point jump that tracks perfectly with undisclosed training cutoffs and late release dates is still a large delta to explain by architecture alone.

This isn't just a CISSP finding: it's a concrete example of how training data contamination inflates LLM benchmark scores when the benchmark is a published book. The training cutoff dates turned this benchmark into an inadvertent contamination study.

For my purposes, it doesn't change the plan. I'm tinkering, not publishing a leaderboard. But it does mean that fine-tuning improvements should be measured against the clean baselines (Llama 3.1 8B and Gemma 3n E4B), not against potentially contaminated scores.

Domain Breakdown

Not all CISSP domains are created equal. Here's how the selected models perform across the eight domains, plus the mixed practice tests (all FP16):

Domain Accuracy Heatmap (FP16)

A few things stand out. Ministral 8B and Qwen3 8B are strong across the board, with no domain below 76%. Ministral peaks on Domain 4 (Network Security) at 86.0%, Qwen3 peaks on Domain 3 (Architecture) at 86.9%.

Gemma 3n E4B and Llama 3.1 8B are remarkably consistent: their per-domain scores track within 1–7 points of each other across all eight domains, despite completely different architectures and providers. This is consistent with both models operating at the genuine capability ceiling for their size class. Llama's strongest domain is D6 (Security Assessment) at 79.0%, covering vulnerability scanning, pen testing, and audit techniques that appear frequently in general cybersecurity training data. Practice test scores track closely with domain-specific scores: there's no significant difficulty gap between domain-tagged questions and the mixed practice exams.

Comparing model accuracy against how human candidates rank domain difficulty reveals an interesting inversion. Domain 4 (Network Security), which humans find relatively approachable, is also where models score highest (68.4%). The inversion is Domain 1 (Risk Management), consistently rated the hardest for humans because it requires thinking like a manager and memorizing frameworks like NIST and ISO 27001. Models score nearly as well there (68.2%), likely because those frameworks are heavily represented in training data.

The harder story is D6 (Security Assessment) and D7 (Security Operations), which are the 2nd and 3rd hardest domains for humans and also the weakest for models (65.2% and 65.3% respectively). Humans struggle there because those domains reward hands-on experience; models struggle because the questions test applied reasoning that doesn't reduce to pattern matching. The overall domain range for models is only about 3 points (65–68%), far flatter than the variance human candidates show depending on their professional background.

How Many Bits Do You Actually Need?

When it comes to running LLMs locally there's a single dominating question: how much VRAM do I need? Quantization is a popular method to make these large models quite a bit smaller, but at what cost? Across 64 configurations, here's what the data shows.

FP16 vs. Q8 vs. Q4

Quantization Accuracy Curves

All values are deltas vs. FP16. Accuracy in percentage points; VRAM in GB.
Model	Q8 Acc Δ	Q4 Acc Δ	Q8 Tok/s Δ	Q4 Tok/s Δ	Q8 VRAM Δ	Q4 VRAM Δ

Q8 is essentially transparent; Q4 is where the field splits. For knowledge-retrieval tasks like this benchmark, no model lost more than 0.6 points at Q8, and several scored marginally higher. At Q4 the picture diverges: Ministral 3 8B, Qwen3 8B, and DeepSeek R1 8B remain virtually flat with spreads under 0.5 points across the full range, while other models begin to show meaningful losses. Whatever the source of their high scores, the knowledge survives aggressive compression intact.

VRAM savings are real; time savings aren't. Moving from FP16 to Q4_K_M cuts memory requirements roughly in half, making deployments that were previously impossible on consumer hardware routine. Inference speed is a different story: most models run only marginally faster at lower precision, typically 1.2–1.5× over FP16 wall time, and some show no meaningful speedup at all. The practical implication is to choose a quantization level based on your VRAM budget, not inference speed requirements.

Llama 3.1 8B pays the steepest price among competitive models. A 3.8-point drop at Q4 is the largest loss in the top half of the field, and it has real consequences: Q4 Llama 3.1 8B lands below the FP16 score of Gemma 3n E4B, a model with a cleaner quantization story.

The Full Quantization Curve

The Llama family and Mistral 7B have seven precision levels each, giving a finer view of the accuracy-vs-compression curve. Llama 3.1 8B is the most instructive:

Accuracy at each of seven quantization levels for the Llama family and Mistral 7B.

The curve is non-monotonic: Q5_K_S (72.6%) nearly matches FP16 (72.8%) while Q6_K drops to 70.8%. This isn't an error. K-quant variants use importance-aware quantization that can preserve critical weights better than uniform higher-bit schemes. The practical upshot: Q5_K_S at roughly 5.3GB delivers 99.8% of FP16 accuracy at 1.7× the inference speed.

Performance and Efficiency

All evaluations ran on a single NVIDIA RTX 4090. The full 16-model benchmark for FP16 completed in under 65 minutes. All runs used pre-loaded, warmed-up models; timing reflects steady-state inference, not cold-start load time.

Accuracy vs. Speed

FP16 accuracy vs. inference speed. Bubble size = peak VRAM. Toggle X-axis between tokens/sec and total wall time.

Speed tracks size with few surprises. Models cluster tightly by parameter count in both tokens/sec and wall time. Qwen3 1.7B leads on wall time and is within striking distance of Llama 3.2 1B on tokens/sec. Llama 3.2 1B edges it on raw throughput, but at near-random accuracy that lead is academic.

The Gemma family is noticeably slower than same-size competitors. Gemma 3 1B is the smallest model tested, yet it's slower than Llama 3.2 3B. This is likely due to architecture differences: Gemma 3 uses a different attention mechanism that's compute-heavy relative to its parameter count.

Gemma 3n's hidden cost. The E2B/E4B "effective" parameter counts are misleading. The full 6B/8B weights still load into memory, so these models run like 8B models, not 2–4B ones (7+ min vs. 2–3 min for comparable models). The MatFormer selective-activation architecture is also tricky for general inference backends, and Gemma 3n's verbose outputs compound the wall time penalty further.

Note: These figures use Ollama (llama.cpp), convenient but not peak-performance. llama.cpp targets single-session inference on consumer hardware and lacks the continuous batching and kernel fusion that purpose-built serving stacks use to maximize GPU utilization. Tokens/sec are valid for relative comparison between models but will understate what vLLM, TensorRT-LLM, or OpenVINO GenAI can achieve.

Question Difficulty

Across all 16 models at FP16, I can identify which questions are universally easy, universally hard, or divisive.

Question Difficulty Distribution

How many of the selected FP16 models answered each question correctly. Red = universally hard, green = universally easy.

The Questions Every Model Got Wrong

27 questions were answered incorrectly by every single FP16 model, zero out of 16. I manually reviewed them and confirmed they are legitimate hard questions with correct answers per the ISC2 Common Body of Knowledge (CBK). No dataset errors.

One example captures the flavor. Consider this Domain 1 question:

Which one of the following actions might be taken as part of a business continuity plan?

A. Restoring from backup tapes

B. Implementing RAID

C. Relocating to a cold site

D. Restarting business operations

Every model chose A, C, or D, all disaster recovery actions. The correct answer is B: implementing RAID is a proactive measure that prevents downtime, which is what BCP is about. Restoring, relocating, and restarting are all reactive: they happen after a disaster, making them DR, not BCP. BCP is about continuity (keep running), DR is about recovery (get back up). Models conflate the two because both involve backups, failover, and resilience. The CISSP exam specifically tests whether you understand the boundary.

Across the impossible questions, four failure patterns emerge:

Pattern 1: ISC2-Specific Terminology Traps

Models apply general IT knowledge where the CISSP uses its own vocabulary. Calling ISC2's clipping by its practitioner name thresholding is one example; marking Preventive for a question about warning signs instead of Directive is another. The knowledge isn't wrong; the label is.

Pattern 2: Multi-Select With a Convincing Wrong Option

On multi-select questions, models consistently include one distractor that sounds correct but is specifically excluded by the CBK. They add threat modeling as something that reduces threat vectors, not recognizing that the CBK treats threats as external and therefore irreducible. They also skip individual contributors as recipients of audit reports, assuming only management receives them. The model needs to learn not just what is true, but what the CISSP considers true.

Pattern 3: "Which Is NOT True" Requiring Exact CBK Knowledge

Questions that invert the usual pattern ask which statement is false, requiring knowledge precise enough to spot the one exception. Models assign classification authority to the system owner rather than the data owner, where the CBK draws a clear line between the two roles. They also count a PIN and a password as two factors, missing that both are Type 1 and the CBK counts factor types, not credentials. Spotting the false claim requires knowing the CBK precisely enough to identify the one exception to an otherwise true rule.

Pattern 4: Counterintuitive ISC2 Positions

The CISSP sometimes takes positions that contradict current industry practice or common intuition. Models assume modern Bluetooth has sufficient encryption for confidential data; the CBK says it does not. They also jump straight to remediation when a vulnerability is found, where the CBK requires validating the finding first. These are genuinely surprising answers. The exam tests the ISC2 body of knowledge, not industry consensus, and every model applies general knowledge instead.

All 27 questions are teaching moments, not bugs. They map precisely to the kind of knowledge fine-tuning can inject: ISC2-specific terminology, CBK-specific positions, and the precise distinctions the exam tests. These are the teachable weaknesses.

Matching Questions: Right Knowledge, Wrong Bindings

The 16 matching questions are the hardest question type in the benchmark. Average accuracy across all FP16 models was 29%, with four models scoring 0%. I initially suspected a bug in response formatting or evaluation, but models responded with valid, simply incorrect selections. Models know the definitions but can't correctly assign five items to five descriptions simultaneously.

The Gimmes

47 questions (3.6%) were answered correctly by every FP16 model. These cover foundational concepts: what directory service underlies Active Directory, what GDPR requires for EU data, what trademark protection covers. Even Llama 3.2 1B, essentially random on most questions, gets these right.

Model Agreement

Model Agreement Matrix (FP16)

Jaccard similarity of wrong-answer sets between FP16 models. Darker = more often wrong on the same questions.

The agreement matrix above measures Jaccard similarity between wrong-answer sets: the fraction of questions that two models both got wrong out of all questions either got wrong. A few patterns reinforce findings from other analyses.

Gemma 3n E2B and E4B are nearly identical (0.595). The highest pairwise similarity in the matrix. They share the same MatFormer architecture and training data, just at different activation levels; they fail on almost exactly the same questions. A useful sanity check that the metric is capturing something structural.

Llama 3.2 3B and Phi-4 Mini are the most similar cross-family pair (0.562). Both are clean-baseline models in the 3–4B class (December 2023 and June 2024 cutoffs respectively), and they share 335 wrong answers. Those shared failures represent questions that genuinely strain the 3–4B capability tier, not contamination artifacts, but hard CISSP material that smaller models miss regardless of architecture. That cluster of 335 questions is a direct fine-tuning target.

DeepSeek R1 and Qwen3 8B show the highest pairwise similarity (0.480). This is unsurprising as DeepSeek R1 8B is a fine-tuned variant of Qwen3. Their shared errors reflect that relationship, not question difficulty, making it less useful as a signal than the clean-baseline pairs.

Ministral 8B and Mistral 7B have the lowest same-family similarity (0.222). Despite sharing the Mistral brand, their error sets are nearly orthogonal. A "model family" is a genealogical label, not a capability claim.

Takeaways

These are small models, though calling an 8B model "small" invites debate. No one is deploying a production pipeline to take the CISSP, but a model that can reason across eight domains of professional certification material at 73% accuracy, zero-shot, is doing something real. The economics of frontier models rule them out for a large class of tasks: private deployments, high-volume inference, edge compute. The results here are evidence that the small-model tier is capable of serious work.

1. The clean ceiling is ~73–74%. Llama 3.1 8B (72.8%, Dec 2023 cutoff) and Gemma 3n E4B (74.3%, Jun 2024 cutoff) converge from different architectures and providers. This is likely the genuine capability limit for 4–8B models on CISSP material without test contamination. Fine-tuning improvements should be measured against this baseline.

2. Training data contamination is real and measurable. Models with undisclosed cutoffs released in late 2024–2025 score 8 points higher than clean models of the same size. This was an inadvertent contamination study, and it's worth being honest about.

3. The 3–4B class is surprisingly competitive. Gemma 3n E4B (74.3%), Gemma 3 4B (69.8%), and Qwen3 4B (76.0%) are all within striking distance of the 8B models. Architecture and training data quality matter far more than raw parameter count.

4. Qwen3 1.7B is the most impressive result per parameter. 69.0% from 1.7B parameters is remarkable, outperforming Phi-4 Mini (3.8B) and Llama 3.2 3B despite being half their size. Its training cutoff is undisclosed, though, so contamination may play a role.

5. Architecture affects inference speed more than parameter count does. Gemma 3 and Gemma 3n are noticeably slower than same-size models from other families; Gemma 3n E4B takes over twice as long as Llama 3.1 8B despite similar size. When inference speed matters, benchmark it directly; parameter count won't tell you.

6. Q8 is a safe default; Q4 is where models split. No model lost more than 0.6 points at Q8. At Q4, some models (Qwen3 8B, Ministral 8B, DeepSeek R1) stay virtually flat; others show meaningful losses. VRAM savings are real: Q4 roughly halves memory requirements, but inference speed gains are marginal. Choose quantization based on VRAM budget, not speed.

What's Next

Three goals drive the next phase:

Build a worthwhile dataset. Source material is the ISC2 Common Body of Knowledge and broader cybersecurity references. Questions are generated from that material using open LLMs, keeping the dataset grounded in source text rather than frontier model outputs.
Explore fine-tuning methodology. This is as much about learning the craft as the results: what data formats work, how much data is enough, where different model sizes respond differently to the same training signal.
Produce a hardware-optimized variant. The end target is an NPU-deployable model: battery-constrained, quantized, running locally on consumer silicon. The model selection below is shaped by what runs well on that hardware.

CISSP accuracy is the measuring stick throughout. These models have clean, well-characterized baselines, so gains from fine-tuning will reflect genuine learning rather than contamination artifacts.

After weighing accuracy, speed, quantization resilience, and contamination risk, I'm narrowing to three models spanning three size classes and two architecture families:

Llama 3.1 8B: clean accuracy leader at 72.8%, definitively pre-test cutoff (December 2023), proven OpenVINO NPU support.
Gemma 3 4B: 69.8% with a borderline cutoff (August 2024), genuine 4B parameters, and a different architecture from Llama. At INT8 on the NPU that's roughly 4GB, leaving ample headroom for KV cache.
Gemma 3 1B: 45.8% baseline, the stress test. If fine-tuning can push a 1B model from the mid-40s into the 60s, that's a stronger result than improving a 72.8% model to 80%. Same architecture as the 4B, so gains between the two isolate the effect of parameter count.

Three models, two architectures, three size classes, all with clean-enough baselines. Fine-tuning gains measured against these starting points will reflect genuine learning, not contamination artifacts. If you know of a clean cybersecurity benchmark to evaluate against, I'd like to hear about it.

The deeper question is whether small models can teach themselves. If the training questions are generated by the same small models that will be fine-tuned on them, the experiment becomes self-referential: can a model improve from its own outputs when those outputs are grounded in source material? That's the question the next post sets out to answer.

All models were evaluated via Ollama on Windows with a single NVIDIA RTX 4090. Evaluation harness and analysis scripts: johnhringiv/cissp-model-eval.