Behavioral SFT stacked on v1 · · mlx-vlm + TurboQuant serving
199 behavioral examples targeting v1's weakest areas. No PE domain data — purely behavioral improvements stacked on v1 SFT.
| Category | Examples | Patterns Trained |
|---|---|---|
| Edge Cases | 80 | Contradiction detection, calibration, anti-hallucination, structured output, format compliance |
| Tool Use | 69 | Search-before-answering, calculator discipline, multi-step chains, tool restraint |
| Conversational | 30 | Topic pivots, brevity, multi-turn coherence, empathy without sycophancy |
| Voice | 20 | Anti-sycophancy, opinionated responses, stream-of-consciousness delivery |
| Parameter | Value |
|---|---|
| Base model | v1 SFT fused onto VLM (69GB bf16) |
| LoRA rank / scale | 16 / 10.0 |
| Target layers | Last 12 of 60 (all attn + MLP) |
| Trainable params | 24.5M / 30,697M (0.08%) |
| Training steps | 400 |
| Batch size / lr | 1 / 2e-5 |
| Max seq length | 1024 |
| Grad checkpoint | Yes |
| Peak GPU memory | ~81 GB (on 96 GB M3 Ultra) |
| Time per step | ~1.5s avg |
| Total training time | ~10 min |
| Test | v1 Score | Issue | v2 Fix Category |
|---|---|---|---|
tool_multi_step |
1.00 | Empty response after tool calls | Tool Use — multi-step chains |
tool_memory_recall |
2.00 | Memory search not triggered | Tool Use — search-before-answering |
voice_casual |
2.00 | Bossy tone instead of casual | Voice — anti-sycophancy |
edge_contradiction |
2.03 | Missed contradictory info | Edge Cases — contradiction detection |
Note: tool_multi_step and voice_casual also failed on eval-server (run 1 scored 0.00 and 3.10 respectively). These are high-variance tests with genuine model weaknesses.
3.85 vs eval-server 3.84 — within noise.4.00 vs eval-server 3-run 4.15. The 0.15 gap driven by 3 high-variance tests that also fail on eval-server.W + (A @ B).T * scale) — mlx_lm.fuse outputs text-only weights incompatible with mlx-vlm. All shards saved with metadata={'format': 'mlx'}.| Benchmark | SFT v1 | Base | Delta | Sig. |
|---|---|---|---|---|
| GSM8K-CoT (flexible, n=500) | 96.0% | 96.2% | -0.2pp | p=0.87 |
| MMLU-Pro (avg, n=504) | 83.1% | 84.3% | -1.2pp | p=0.61 |
mlx-lm-lora v1.1.8 on Mac Studio M3 Ultra (96GB). LoRA on all attention + MLP projections in last 12 layers.W + (A @ B).T * scale applied to VLM base shards, preserving vision tower + audio tower weights.mlx-vlm with TurboQuant KV-4bit compression on port 8092. OpenAI-compatible API.claude -p. 3-run averaged for statistical robustness.