A rigorous, mathematical verification of information limits under a strict compute constraint. We held total token exposure starr at 200,000,000 processed steps, testing total unique data volume directly against looping recurrent cycles.
Chinchilla compute curves dictate linear resource scaling. Our targeted isolation runs reveal an asymmetric divergence between static loss optimization and objective downstream capability inside sub-10M environments:
Every configuration is locked to exactly 200M total token exposure steps. Validation loss tracking alone is deceptive due to language overfitting parameters.
| Benchmark / Metric | Run 1: 200M × 1 (🏆 Facts Win) | Run 2: 100M × 2 (🏆 PPL Win) | Run 3: 50M × 4 | Run 4: 40M × 5 | Run 5: 25M × 8 |
|---|---|---|---|---|---|
| Unique Tokens Pool | 200,000,000 | 100,000,000 | 50,000,000 | 40,000,000 | 25,000,000 |
| Training Epochs Block | 1 Epoch | 2 Epochs | 4 Epochs | 5 Epochs | 8 Epochs |
| Final Pretrain Loss (↓) | 3.789 | 3.771 | 3.785 | 3.771 | 3.719 |
| Final Pretrain Train Loss (↓) | 4.240 | 4.229 | 4.235 | 4.225 | 4.196 |
| ARC-Easy Zero-Shot (↑) | 33.42% | 31.57% | 31.82% | 31.69% | 30.93% |
| Wikitext Byte PPL (↓) | 1.4824 | 1.4750 | 1.4851 | 1.4918 | 1.5017 |
| Wikitext Word PPL (↓) | 243.3377 | 236.8014 | 245.8708 | 252.0078 | 261.4054 |
| PRETRAIN ASSESSMENT | MAXIMUM KNOWLEDGE | SYNTAX ENHANCED | Recycling Decay | Memorization Masking | Severe Overfitting |
Crucial observation: While recycling data (increasing epochs) forces the loss curve downward artificially, true out-of-distribution Perplexity steadily degrades.