SupraLabs | The Embedding Bottleneck: Optimal Vocab Scales

Experiment #5:
The Embedding Bottleneck — Vocab Size

Mapping the discrete trade-off between vocabulary size and active internal hidden architecture weights. We processed a steady volume of 50,000,000 tokens across 5 unique sub-7.5M models running an optimized shallow and wide architecture template.

Navigating the Pareto Frontier for Megabyte Architecture

Standard LLMs favor vast vocabularies (32k–128k) to keep tokenization counts short. Our sweep documents a fatal parameter theft paradox when shrinking downstream profiles to sub-10M boundaries:

The Sub-Token Fragmentation Cliff: Dropping vocabulary sizes down to 1024 or 2048 fractures words into tiny components. It exhausts the 1024 sequence length with formatting shards, exploding unprompted Word Perplexity beyond 1000.
The Embedding Parameter Theft: Expanding the vocabulary to 16,384 maps full expressions easily, reducing Perplexity to 359.2. However, it spikes total parameter allocations by over 114% exclusively inside the lookup layer, crippling the transformer hidden logical layers.
The 4096 Strategic Equilibrium: At a vocab ceiling of 4096, the fragmentation curve flattens completely. Word Perplexity drops by half to 467.2, capturing clean syntactic continuity while leaving processing parameter room for reasoning paths.

Unbiased Tokenizer Scaling Data

Downstream metrics evaluated at zero-shot boundaries. Word Perplexity (PPL) serves as the primary metric for comparative linguistic clarity.

Benchmark / Metric	Run 1: 1024 Vocab	Run 2: 2048 Vocab	Run 3: 4096 Vocab (🏆 Peak)	Run 4: 8192 Vocab	Run 5: 16384 Vocab
Total Active Parameters	3,409,664	3,671,808	4,196,096	5,244,672	7,341,824
Pretrain Train Loss (↓)	3.614	4.172	4.598	5.063	5.409
ARC-Easy Zero-Shot (↑)	28.37%	29.67%	28.32%	30.77%	30.93%
Wikitext Byte PPL (↓)	3.7336	3.6693	3.1566	3.0746	3.0052
Wikitext Word PPL (↓)	1146.6974	1044.8747	467.2369	405.9334	359.2878
Pretrain Compute Speed (⚡)	8.43 steps/sec	8.03 steps/sec	7.38 steps/sec	6.95 steps/sec	5.03 steps/sec
EMBEDDING STATUS	Context Fragmentation	Information Degradation	PARETO CEILING	Layer Starvation	Parameter Overflow

Total Active Model Parameters vs. Tokenizer Throughput Steps

As vocab sizes scale up, the active parameter volume jumps exponentially inside the static lookup blocks, choking compute steps.

SupraLabs_

Experiment #5:
The Embedding Bottleneck — Vocab Size

Navigating the Pareto Frontier for Megabyte Architecture

Unbiased Tokenizer Scaling Data

Linguistic Perplexity Collapse vs. Vocabulary Expansion

Total Active Model Parameters vs. Tokenizer Throughput Steps

Experiment #5:The Embedding Bottleneck — Vocab Size

Navigating the Pareto Frontier for Megabyte Architecture

Unbiased Tokenizer Scaling Data

Linguistic Perplexity Collapse vs. Vocabulary Expansion

Total Active Model Parameters vs. Tokenizer Throughput Steps

Experiment #5:
The Embedding Bottleneck — Vocab Size