Mapping the discrete trade-off between vocabulary size and active internal hidden architecture weights. We processed a steady volume of 50,000,000 tokens across 5 unique sub-7.5M models running an optimized shallow and wide architecture template.
Standard LLMs favor vast vocabularies (32k–128k) to keep tokenization counts short. Our sweep documents a fatal parameter theft paradox when shrinking downstream profiles to sub-10M boundaries:
Downstream metrics evaluated at zero-shot boundaries. Word Perplexity (PPL) serves as the primary metric for comparative linguistic clarity.
| Benchmark / Metric | Run 1: 1024 Vocab | Run 2: 2048 Vocab | Run 3: 4096 Vocab (🏆 Peak) | Run 4: 8192 Vocab | Run 5: 16384 Vocab |
|---|---|---|---|---|---|
| Total Active Parameters | 3,409,664 | 3,671,808 | 4,196,096 | 5,244,672 | 7,341,824 |
| Pretrain Train Loss (↓) | 3.614 | 4.172 | 4.598 | 5.063 | 5.409 |
| ARC-Easy Zero-Shot (↑) | 28.37% | 29.67% | 28.32% | 30.77% | 30.93% |
| Wikitext Byte PPL (↓) | 3.7336 | 3.6693 | 3.1566 | 3.0746 | 3.0052 |
| Wikitext Word PPL (↓) | 1146.6974 | 1044.8747 | 467.2369 | 405.9334 | 359.2878 |
| Pretrain Compute Speed (⚡) | 8.43 steps/sec | 8.03 steps/sec | 7.38 steps/sec | 6.95 steps/sec | 5.03 steps/sec |
| EMBEDDING STATUS | Context Fragmentation | Information Degradation | PARETO CEILING | Layer Starvation | Parameter Overflow |
As vocab sizes scale up, the active parameter volume jumps exponentially inside the static lookup blocks, choking compute steps.