Anvil — Gemma 4 31B SFT Dashboard

1

Iteration History

v4 — Targeted Retrain CURRENT

2026-04-05 · 696 examples · 500 iters · val loss 1.552

Trained on base model (not stacked). 626 train / 70 valid. Weakness-targeted: +50 tool use, +44 calibration, +40 contradiction, +35 multi-step reasoning. Training complete, eval pending.

v2 — Behavioral Stacking

2026-04-04 · Composite: 3.93 · -5.3% vs v1

199 behavioral examples stacked on v1. Regressed across most categories. Stacking approach didn't work — training on already-fine-tuned weights degraded quality.

v1 — First SFT BEST

2026-04-04 · Composite: 4.15 · +8.1% vs base

Proof of concept. Zero knowledge regression on GSM8K/MMLU-Pro. Strongest gains in tool use (+0.60) and edge cases (+0.57). Weakest: tool_multi_step (1.00), calibration (2.00), contradiction (2.03).

Base — Gemma 4 31B IT

Composite: 3.84 · Vanilla bf16

Instruction-tuned baseline. Strong conversational (4.31) but weak on tool use (3.46) and edge cases (3.30).

2

v4 Training Data TARGETED

696 total examples (626 train / 70 valid). Trained on base model, not stacked. Weakness categories weighted by eval gap.

408

Legacy (v1/v2/v3)

50

Tool Use

44

Calibration

40

Contradiction

Category	Examples	Eval Gap Targeted
tool_use_sophisticated	50	Score 1.0 tool_multi_step empty responses
calibration_epistemic	44	Score 2.0 calibration_insufficient failures
contradiction_handling	40	Score 2.0 edge_contradiction missed
multi_step_reasoning	35	Score 1.0 multi-step chain failures
conversation_quality	30	Score 3.26 brevity and empathy
voice_personality	25	Stream-of-consciousness, casual tone
edge_anti_hallucination	25	Hallucination prevention patterns
Legacy (v1/v2/v3)	408	Core PE domain + general quality

3

v4 Training Loss

SFT Loss — 500 steps, LoRA rank=16, lr=5e-6, 16 layers

Parameter	Value
Base model	`gemma-4-31b-it-bf16 (58GB)`
LoRA rank / scale	`16 / 10.0`
Target layers	`Last 16 of 60 (all attn + MLP)`
Trainable params	`16.3M / 30,697M (0.053%)`
Training steps	`500`
Batch size / lr	`1 / 5e-6`
Max seq length	`2048`
Grad checkpoint	`Yes`
Final train loss	`1.842`
Final val loss	`1.552`
Peak GPU memory	`64.1 GB (on 96 GB M3 Ultra)`
Time per step	`~2.0s avg`
Total training time	`~17 min`

4

Category Scores — v1 vs v2 vs Base

Phipps Eval v3 (3-run avg, Sonnet judge)

Category	v1	v2	Base	v1 vs Base
PE Domain	4.23	4.00	4.04	+0.19
Conversation	4.27	4.24	4.31	-0.04
Tool Use	4.06	3.78	3.46	+0.60
Voice	4.46	4.25	4.35	+0.11
Edge Cases	3.87	3.57	3.30	+0.57
Composite	4.15	3.93	3.84	+0.31

5

v4 Target Tests — Worst Performers

Tests scoring below 3.0 (v2 eval, 3-run avg)

Test	v2 Score	v1 Score	Issue	v4 Data Category
`tool_multi_step`	1.00	1.00	Empty response after tool calls	tool_use_sophisticated (50)
`edge_contradiction`	1.99	2.03	Missed contradictory info	contradiction_handling (40)
`calibration_insufficient`	2.00	—	No uncertainty expression	calibration_epistemic (44)
`pe_hidden_flaw`	2.39	—	Missed hidden deal flaws	Legacy PE domain

6

Knowledge Preservation (v1)

GSM8K-CoT (Math Reasoning)

MMLU-Pro (Knowledge)

Benchmark	SFT v1	Base	Delta	Sig.
GSM8K-CoT (flexible, n=500)	96.0%	96.2%	-0.2pp	`p=0.87`
MMLU-Pro (avg, n=504)	83.1%	84.3%	-1.2pp	`p=0.61`

7

Lessons Learned

Stacking doesn't work. v2 trained on v1-fused weights regressed -5.3%. Training on already-fine-tuned weights degrades quality across all categories.

Train on base, every time. v4 returns to base model as starting point. All improvements should come from data quality, not weight stacking.

Target the gaps. v4 dataset is 3.5x larger than v2 (696 vs 199) with examples specifically written for the failing test categories.

Bigger dataset, lower lr. v4 uses lr=5e-6 (vs v2's 2e-5) and 500 iters (vs 400) to avoid overfitting with the larger dataset.

8

Pipeline Architecture

Training: mlx-lm-lora v1.1.8 on Mac Studio M3 Ultra (96GB). LoRA on all attention + MLP projections in last 16 layers.

Fusing: Auto-fused by training script. VLM-aware merge for serving on mlx-vlm with vision/audio tower preservation.

Serving: mlx-vlm with TurboQuant KV-4bit compression. OpenAI-compatible API.

Evaluation: Phipps Eval v3 — 33 tests, 5 categories, Sonnet judge via claude -p. 3-run averaged for statistical robustness.