Phipps Eval (Sonnet-judged, 3-run avg) + lm-eval public benchmarks ·
| Model | Composite | PE Domain | Conv | Tool | Voice | Edge |
|---|---|---|---|---|---|---|
| SFT 31B | 4.15 | 4.23 | 4.27 | 4.06 | 4.46 | 3.87 |
| Base 31B | 3.90 | 4.02 | 4.43 | 3.60 | 4.25 | 3.40 |
| Benchmark | SFT 31B | Base 31B |
|---|---|---|
| GSM8K-CoT (flexible) | done 96.0% |
running
51%
|
| GSM8K-CoT (strict) | 93.0% | queued |
| MMLU-Pro (avg) | 83.1% | queued |
+0.25 composite gain over base.+0.46 and Edge Cases +0.47 — the areas we trained for.96% and MMLU-Pro 83.1% confirm SFT preserved reasoning capability.