Anvil — Gemma 4 31B SFT Pipeline

v2 Training & Evaluation

Behavioral SFT stacked on v1 · · mlx-vlm + TurboQuant serving

SFT v2 Composite
3-run eval pending RUNNING
SFT v1 Composite
4.15
3-run avg, eval-server
Base Composite
3.84
3-run avg, eval-server
1

v2 Training Data NEW

199 behavioral examples targeting v1's weakest areas. No PE domain data — purely behavioral improvements stacked on v1 SFT.

80
Edge Cases
69
Tool Use
30
Conversational
20
Voice / Persona
CategoryExamplesPatterns Trained
Edge Cases80Contradiction detection, calibration, anti-hallucination, structured output, format compliance
Tool Use69Search-before-answering, calculator discipline, multi-step chains, tool restraint
Conversational30Topic pivots, brevity, multi-turn coherence, empathy without sycophancy
Voice20Anti-sycophancy, opinionated responses, stream-of-consciousness delivery
2

v2 Training Loss

SFT Loss — 400 steps, LoRA rank=16, lr=2e-5
ParameterValue
Base modelv1 SFT fused onto VLM (69GB bf16)
LoRA rank / scale16 / 10.0
Target layersLast 12 of 60 (all attn + MLP)
Trainable params24.5M / 30,697M (0.08%)
Training steps400
Batch size / lr1 / 2e-5
Max seq length1024
Grad checkpointYes
Peak GPU memory~81 GB (on 96 GB M3 Ultra)
Time per step~1.5s avg
Total training time~10 min
3

v1 GATE Failures — What v2 Targets

Testv1 ScoreIssuev2 Fix Category
tool_multi_step 1.00 Empty response after tool calls Tool Use — multi-step chains
tool_memory_recall 2.00 Memory search not triggered Tool Use — search-before-answering
voice_casual 2.00 Bossy tone instead of casual Voice — anti-sycophancy
edge_contradiction 2.03 Missed contradictory info Edge Cases — contradiction detection

Note: tool_multi_step and voice_casual also failed on eval-server (run 1 scored 0.00 and 3.10 respectively). These are high-variance tests with genuine model weaknesses.

4

v1 Reference — Phipps Eval v3 v1

Category Scores (3-run avg)
SFT v1 Delta vs Base
5

mlx-vlm Serving Validation

Base model: equivalent. mlx-vlm 3.85 vs eval-server 3.84 — within noise.
SFT v1: within variance. mlx-vlm 1-run 4.00 vs eval-server 3-run 4.15. The 0.15 gap driven by 3 high-variance tests that also fail on eval-server.
VLM fuse required manual LoRA merge (W + (A @ B).T * scale) — mlx_lm.fuse outputs text-only weights incompatible with mlx-vlm. All shards saved with metadata={'format': 'mlx'}.
TurboQuant KV-4bit: No measurable quality impact. Reduces KV cache memory ~4x.
6

Knowledge Preservation (v1)

GSM8K-CoT (Math Reasoning)
MMLU-Pro (Knowledge)
BenchmarkSFT v1BaseDeltaSig.
GSM8K-CoT (flexible, n=500)96.0%96.2%-0.2ppp=0.87
MMLU-Pro (avg, n=504)83.1%84.3%-1.2ppp=0.61
7

Pipeline Architecture

Training: mlx-lm-lora v1.1.8 on Mac Studio M3 Ultra (96GB). LoRA on all attention + MLP projections in last 12 layers.
Fusing: Custom VLM-aware merge script. Manual W + (A @ B).T * scale applied to VLM base shards, preserving vision tower + audio tower weights.
Serving: mlx-vlm with TurboQuant KV-4bit compression on port 8092. OpenAI-compatible API.
Evaluation: Phipps Eval v3 — 33 tests, 5 categories, Sonnet judge via claude -p. 3-run averaged for statistical robustness.
Stacking approach: v2 LoRA trained on v1 fused model (not base). Preserves PE domain gains while adding behavioral improvements.