Weakness-targeted SFT on base model · · 696 examples, 500 iters
696 total examples (626 train / 70 valid). Trained on base model, not stacked. Weakness categories weighted by eval gap.
| Category | Examples | Eval Gap Targeted |
|---|---|---|
| tool_use_sophisticated | 50 | Score 1.0 tool_multi_step empty responses |
| calibration_epistemic | 44 | Score 2.0 calibration_insufficient failures |
| contradiction_handling | 40 | Score 2.0 edge_contradiction missed |
| multi_step_reasoning | 35 | Score 1.0 multi-step chain failures |
| conversation_quality | 30 | Score 3.26 brevity and empathy |
| voice_personality | 25 | Stream-of-consciousness, casual tone |
| edge_anti_hallucination | 25 | Hallucination prevention patterns |
| Legacy (v1/v2/v3) | 408 | Core PE domain + general quality |
| Parameter | Value |
|---|---|
| Base model | gemma-4-31b-it-bf16 (58GB) |
| LoRA rank / scale | 16 / 10.0 |
| Target layers | Last 16 of 60 (all attn + MLP) |
| Trainable params | 16.3M / 30,697M (0.053%) |
| Training steps | 500 |
| Batch size / lr | 1 / 5e-6 |
| Max seq length | 2048 |
| Grad checkpoint | Yes |
| Final train loss | 1.842 |
| Final val loss | 1.552 |
| Peak GPU memory | 64.1 GB (on 96 GB M3 Ultra) |
| Time per step | ~2.0s avg |
| Total training time | ~17 min |
| Category | v1 | v2 | Base | v1 vs Base |
|---|---|---|---|---|
| PE Domain | 4.23 | 4.00 | 4.04 | +0.19 |
| Conversation | 4.27 | 4.24 | 4.31 | -0.04 |
| Tool Use | 4.06 | 3.78 | 3.46 | +0.60 |
| Voice | 4.46 | 4.25 | 4.35 | +0.11 |
| Edge Cases | 3.87 | 3.57 | 3.30 | +0.57 |
| Composite | 4.15 | 3.93 | 3.84 | +0.31 |
| Test | v2 Score | v1 Score | Issue | v4 Data Category |
|---|---|---|---|---|
tool_multi_step |
1.00 | 1.00 | Empty response after tool calls | tool_use_sophisticated (50) |
edge_contradiction |
1.99 | 2.03 | Missed contradictory info | contradiction_handling (40) |
calibration_insufficient |
2.00 | — | No uncertainty expression | calibration_epistemic (44) |
pe_hidden_flaw |
2.39 | — | Missed hidden deal flaws | Legacy PE domain |
| Benchmark | SFT v1 | Base | Delta | Sig. |
|---|---|---|---|---|
| GSM8K-CoT (flexible, n=500) | 96.0% | 96.2% | -0.2pp | p=0.87 |
| MMLU-Pro (avg, n=504) | 83.1% | 84.3% | -1.2pp | p=0.61 |
-5.3%. Training on already-fine-tuned weights degrades quality across all categories.lr=5e-6 (vs v2's 2e-5) and 500 iters (vs 400) to avoid overfitting with the larger dataset.mlx-lm-lora v1.1.8 on Mac Studio M3 Ultra (96GB). LoRA on all attention + MLP projections in last 16 layers.mlx-vlm with TurboQuant KV-4bit compression. OpenAI-compatible API.claude -p. 3-run averaged for statistical robustness.