An isolated geometric sweep allocating a uniform parameter budget (~5M) to explore structural limits. We stressed tested deep sequential layers against shallow parallel computing layers over 50,000,000 unique tokens using local bfloat16 hardware accelerator engines.
Traditional deep-network assumptions state that depth unlocks highly abstract multi-step logic circuits. Our empirical architectural investigation completely reverses this paradigm for Small Language Models (SLMs) under 10M parameters:
Pretrained via optimized hyperparameter constants. Perplexity (PPL) and Lower Pretrain Loss track the precision of language acquisition.
| Structural Metric | Exp 1: Deep & Narrow | Exp 2: Balanced Baseline | Exp 3: Shallow & Wide (🏆 Win) |
|---|---|---|---|
| Layer Configuration | 12 Layers (Hidden: 128) | 6 Layers (Hidden: 192) | 3 Layers (Hidden: 256) |
| Active Model Parameters | 4,197,504 | 5,114,304 | 5,244,672 |
| Pretrain Step Loss (↓) | 4.539 | 4.345 | 4.188 |
| Pretrain Train Loss (↓) | 5.567 | 5.302 | 5.093 |
| ARC-Easy Zero-Shot (↑) | 29.17% | 29.63% | 29.97% |
| Wikitext Word PPL (↓) | 817.8844 | 585.5899 | 418.6314 |
| Compute Throughput (⚡) | 218.7 samples/sec | 282.9 samples/sec | 417.9 samples/sec |
| EFFICIENCY MATRIX | Severe Structural Bottleneck | Intermediate Compression | SOTA TOPOLOGY SWEETSPOT |
Fewer structural steps cut matrix synchronization gaps, unlocking maximum execution parallelization on local Tensor Core engines.