Anvil v1 Benchmark Report

Gemma 4 31B — SFT vs Base

Phipps Eval (Sonnet-judged, 3-run avg) + lm-eval public benchmarks ·

SFT 31B · Phipps Composite
4.15
Wins 4/5 categories · +0.25 vs base
Base 31B · Phipps Composite
3.90
Wins Conversation only · −0.25 vs SFT
1

Phipps Eval — Category Breakdown

SFT 31B
Base 31B
Score by Category (1–5 scale)
2

Phipps Eval — Full Scores

Model Composite PE Domain Conv Tool Voice Edge
SFT 31B 4.15 4.23 4.27 4.06 4.46 3.87
Base 31B 3.90 4.02 4.43 3.60 4.25 3.40
3

SFT Gains vs Base

SFT − Base per category
4

lm-eval Public Benchmarks

GSM8K-CoT (Math Reasoning)
MMLU-Pro (Knowledge)
Benchmark SFT 31B Base 31B
GSM8K-CoT (flexible) done 96.0% running
51%
GSM8K-CoT (strict) 93.0% queued
MMLU-Pro (avg) 83.1% queued
5

MMLU-Pro by Subject (SFT 31B)

6

Key Takeaways

SFT wins 4 of 5 Phipps categories with +0.25 composite gain over base.
Biggest gains: Tool Calling +0.46 and Edge Cases +0.47 — the areas we trained for.
Base beats SFT only on Conversation (4.43 vs 4.27) — SFT slightly over-optimizes for structured tasks.
GSM8K 96% and MMLU-Pro 83.1% confirm SFT preserved reasoning capability.
Base lm-eval still running — full comparison pending. ETA mid-morning.
Next: v2 SFT targeting remaining Tool and Edge gaps.