Anvil — Gemma 4 31B SFT Pipeline

v4 Targeted Retrain

Weakness-targeted SFT on base model · · 696 examples, 500 iters

v4 Composite
TBD
Eval pending PENDING
v2 Composite
3.93
3-run, mlx-vlm -5.3%
v1 Composite
4.15
3-run, eval-server +8.1%
Base Composite
3.84
3-run avg, eval-server
1

Iteration History

v4 — Targeted Retrain CURRENT
2026-04-05 · 696 examples · 500 iters · val loss 1.552
Trained on base model (not stacked). 626 train / 70 valid. Weakness-targeted: +50 tool use, +44 calibration, +40 contradiction, +35 multi-step reasoning. Training complete, eval pending.
v2 — Behavioral Stacking
2026-04-04 · Composite: 3.93 · -5.3% vs v1
199 behavioral examples stacked on v1. Regressed across most categories. Stacking approach didn't work — training on already-fine-tuned weights degraded quality.
v1 — First SFT BEST
2026-04-04 · Composite: 4.15 · +8.1% vs base
Proof of concept. Zero knowledge regression on GSM8K/MMLU-Pro. Strongest gains in tool use (+0.60) and edge cases (+0.57). Weakest: tool_multi_step (1.00), calibration (2.00), contradiction (2.03).
Base — Gemma 4 31B IT
Composite: 3.84 · Vanilla bf16
Instruction-tuned baseline. Strong conversational (4.31) but weak on tool use (3.46) and edge cases (3.30).
2

v4 Training Data TARGETED

696 total examples (626 train / 70 valid). Trained on base model, not stacked. Weakness categories weighted by eval gap.

408
Legacy (v1/v2/v3)
50
Tool Use
44
Calibration
40
Contradiction
CategoryExamplesEval Gap Targeted
tool_use_sophisticated50Score 1.0 tool_multi_step empty responses
calibration_epistemic44Score 2.0 calibration_insufficient failures
contradiction_handling40Score 2.0 edge_contradiction missed
multi_step_reasoning35Score 1.0 multi-step chain failures
conversation_quality30Score 3.26 brevity and empathy
voice_personality25Stream-of-consciousness, casual tone
edge_anti_hallucination25Hallucination prevention patterns
Legacy (v1/v2/v3)408Core PE domain + general quality
3

v4 Training Loss

SFT Loss — 500 steps, LoRA rank=16, lr=5e-6, 16 layers
ParameterValue
Base modelgemma-4-31b-it-bf16 (58GB)
LoRA rank / scale16 / 10.0
Target layersLast 16 of 60 (all attn + MLP)
Trainable params16.3M / 30,697M (0.053%)
Training steps500
Batch size / lr1 / 5e-6
Max seq length2048
Grad checkpointYes
Final train loss1.842
Final val loss1.552
Peak GPU memory64.1 GB (on 96 GB M3 Ultra)
Time per step~2.0s avg
Total training time~17 min
4

Category Scores — v1 vs v2 vs Base

Phipps Eval v3 (3-run avg, Sonnet judge)
Categoryv1v2Basev1 vs Base
PE Domain4.234.004.04+0.19
Conversation4.274.244.31-0.04
Tool Use4.063.783.46+0.60
Voice4.464.254.35+0.11
Edge Cases3.873.573.30+0.57
Composite4.153.933.84+0.31
5

v4 Target Tests — Worst Performers

Tests scoring below 3.0 (v2 eval, 3-run avg)
Testv2 Scorev1 ScoreIssuev4 Data Category
tool_multi_step 1.00 1.00 Empty response after tool calls tool_use_sophisticated (50)
edge_contradiction 1.99 2.03 Missed contradictory info contradiction_handling (40)
calibration_insufficient 2.00 No uncertainty expression calibration_epistemic (44)
pe_hidden_flaw 2.39 Missed hidden deal flaws Legacy PE domain
6

Knowledge Preservation (v1)

GSM8K-CoT (Math Reasoning)
MMLU-Pro (Knowledge)
BenchmarkSFT v1BaseDeltaSig.
GSM8K-CoT (flexible, n=500)96.0%96.2%-0.2ppp=0.87
MMLU-Pro (avg, n=504)83.1%84.3%-1.2ppp=0.61
7

Lessons Learned

Stacking doesn't work. v2 trained on v1-fused weights regressed -5.3%. Training on already-fine-tuned weights degrades quality across all categories.
Train on base, every time. v4 returns to base model as starting point. All improvements should come from data quality, not weight stacking.
Target the gaps. v4 dataset is 3.5x larger than v2 (696 vs 199) with examples specifically written for the failing test categories.
Bigger dataset, lower lr. v4 uses lr=5e-6 (vs v2's 2e-5) and 500 iters (vs 400) to avoid overfitting with the larger dataset.
8

Pipeline Architecture

Training: mlx-lm-lora v1.1.8 on Mac Studio M3 Ultra (96GB). LoRA on all attention + MLP projections in last 16 layers.
Fusing: Auto-fused by training script. VLM-aware merge for serving on mlx-vlm with vision/audio tower preservation.
Serving: mlx-vlm with TurboQuant KV-4bit compression. OpenAI-compatible API.
Evaluation: Phipps Eval v3 — 33 tests, 5 categories, Sonnet judge via claude -p. 3-run averaged for statistical robustness.