Experiment #5:
The Embedding Bottleneck — Vocab Size

Mapping the discrete trade-off between vocabulary size and active internal hidden architecture weights. We processed a steady volume of 50,000,000 tokens across 5 unique sub-7.5M models running an optimized shallow and wide architecture template.

// Tokenization_Gaps_&_Structural_Erosion

Navigating the Pareto Frontier for Megabyte Architecture

Standard LLMs favor vast vocabularies (32k–128k) to keep tokenization counts short. Our sweep documents a fatal parameter theft paradox when shrinking downstream profiles to sub-10M boundaries:

// Vocabulary_Compression_Matrix

Unbiased Tokenizer Scaling Data

Downstream metrics evaluated at zero-shot boundaries. Word Perplexity (PPL) serves as the primary metric for comparative linguistic clarity.

Benchmark / Metric Run 1: 1024 Vocab Run 2: 2048 Vocab Run 3: 4096 Vocab (🏆 Peak) Run 4: 8192 Vocab Run 5: 16384 Vocab
Total Active Parameters 3,409,664 3,671,808 4,196,096 5,244,672 7,341,824
Pretrain Train Loss (↓) 3.614 4.172 4.598 5.063 5.409
ARC-Easy Zero-Shot (↑) 28.37% 29.67% 28.32% 30.77% 30.93%
Wikitext Byte PPL (↓) 3.7336 3.6693 3.1566 3.0746 3.0052
Wikitext Word PPL (↓) 1146.6974 1044.8747 467.2369 405.9334 359.2878
Pretrain Compute Speed (⚡) 8.43 steps/sec 8.03 steps/sec 7.38 steps/sec 6.95 steps/sec 5.03 steps/sec
EMBEDDING STATUS Context Fragmentation Information Degradation PARETO CEILING Layer Starvation Parameter Overflow
// Mapping_The_Tokenizer_Trade-offs

Linguistic Perplexity Collapse vs. Vocabulary Expansion

Total Active Model Parameters vs. Tokenizer Throughput Steps

As vocab sizes scale up, the active parameter volume jumps exponentially inside the static lookup blocks, choking compute steps.

COMPUTE ALLOCATION Static S&W Topology Grid
ISOLATEDPretrain BATCH 50,000,000 Volume Steps
SOTA SELECTION MATRIX 4096 Balanced Ceiling