Experiment #6:
More Epochs vs. More Data for SLMs

A rigorous, mathematical verification of information limits under a strict compute constraint. We held total token exposure starr at 200,000,000 processed steps, testing total unique data volume directly against looping recurrent cycles.

// Data_Entropy_&_Reasoning_Loss

Isolating Token Freshness in the Static Latent Block

Chinchilla compute curves dictate linear resource scaling. Our targeted isolation runs reveal an asymmetric divergence between static loss optimization and objective downstream capability inside sub-10M environments:

// Information_Density_Matrix

Symmetric Token Matrix Results

Every configuration is locked to exactly 200M total token exposure steps. Validation loss tracking alone is deceptive due to language overfitting parameters.

Benchmark / Metric Run 1: 200M × 1 (🏆 Facts Win) Run 2: 100M × 2 (🏆 PPL Win) Run 3: 50M × 4 Run 4: 40M × 5 Run 5: 25M × 8
Unique Tokens Pool 200,000,000 100,000,000 50,000,000 40,000,000 25,000,000
Training Epochs Block 1 Epoch 2 Epochs 4 Epochs 5 Epochs 8 Epochs
Final Pretrain Loss (↓) 3.789 3.771 3.785 3.771 3.719
Final Pretrain Train Loss (↓) 4.240 4.229 4.235 4.225 4.196
ARC-Easy Zero-Shot (↑) 33.42% 31.57% 31.82% 31.69% 30.93%
Wikitext Byte PPL (↓) 1.4824 1.4750 1.4851 1.4918 1.5017
Wikitext Word PPL (↓) 243.3377 236.8014 245.8708 252.0078 261.4054
PRETRAIN ASSESSMENT MAXIMUM KNOWLEDGE SYNTAX ENHANCED Recycling Decay Memorization Masking Severe Overfitting
// Plotting_The_Entropy_Divergence

Factual Logic Degradation vs. Looping Cycles

The Overfitting Paradox: True Language PPL vs. Apparent Training Loss

Crucial observation: While recycling data (increasing epochs) forces the loss curve downward artificially, true out-of-distribution Perplexity steadily degrades.

CONSTANT MATRIX SIZE 200M Processing Steps
COMPUTE TOPOLOGY Shallow & Wide SOTA Layout
DATASET ROUTING ENGINE FineWeb-Edu Target Stream