SupraLabs | Satiating the Latent Space: Unique Tokens vs. Cycles

Experiment #6:
More Epochs vs. More Data for SLMs

A rigorous, mathematical verification of information limits under a strict compute constraint. We held total token exposure starr at 200,000,000 processed steps, testing total unique data volume directly against looping recurrent cycles.

Isolating Token Freshness in the Static Latent Block

Chinchilla compute curves dictate linear resource scaling. Our targeted isolation runs reveal an asymmetric divergence between static loss optimization and objective downstream capability inside sub-10M environments:

The Logic Divergence Cliff (Run 1): Maximizing unique data exposure (200M unique steps × 1 Epoch) delivers the highest reasoning performance, claiming 33.42% accuracy on factual deduction (ARC-Easy). Fresh token entropy is essential for non-repetitive learning.
The Perplexity Sweetspot (Run 2): Running a micro-cycle (100M unique steps × 2 Epochs) yields a slight boost in base linguistic perplexity (236.80). The immediate repetition helps tiny architectures reinforce core syntactic boundaries.
The Overfitting Illusion (Run 5): Compressing unique data down to 25M while repeating for 8 full epochs drops training loss to its absolute minimum (4.196). However, this triggers semantic memorization, ruining factual reasoning properties.

Symmetric Token Matrix Results

Every configuration is locked to exactly 200M total token exposure steps. Validation loss tracking alone is deceptive due to language overfitting parameters.

Benchmark / Metric	Run 1: 200M × 1 (🏆 Facts Win)	Run 2: 100M × 2 (🏆 PPL Win)	Run 3: 50M × 4	Run 4: 40M × 5	Run 5: 25M × 8
Unique Tokens Pool	200,000,000	100,000,000	50,000,000	40,000,000	25,000,000
Training Epochs Block	1 Epoch	2 Epochs	4 Epochs	5 Epochs	8 Epochs
Final Pretrain Loss (↓)	3.789	3.771	3.785	3.771	3.719
Final Pretrain Train Loss (↓)	4.240	4.229	4.235	4.225	4.196
ARC-Easy Zero-Shot (↑)	33.42%	31.57%	31.82%	31.69%	30.93%
Wikitext Byte PPL (↓)	1.4824	1.4750	1.4851	1.4918	1.5017
Wikitext Word PPL (↓)	243.3377	236.8014	245.8708	252.0078	261.4054
PRETRAIN ASSESSMENT	MAXIMUM KNOWLEDGE	SYNTAX ENHANCED	Recycling Decay	Memorization Masking	Severe Overfitting

The Overfitting Paradox: True Language PPL vs. Apparent Training Loss

Crucial observation: While recycling data (increasing epochs) forces the loss curve downward artificially, true out-of-distribution Perplexity steadily degrades.

SupraLabs_

Experiment #6:
More Epochs vs. More Data for SLMs

Isolating Token Freshness in the Static Latent Block

Symmetric Token Matrix Results

Factual Logic Degradation vs. Looping Cycles

The Overfitting Paradox: True Language PPL vs. Apparent Training Loss

Experiment #6:More Epochs vs. More Data for SLMs

Isolating Token Freshness in the Static Latent Block

Symmetric Token Matrix Results

Factual Logic Degradation vs. Looping Cycles

The Overfitting Paradox: True Language PPL vs. Apparent Training Loss

Experiment #6:
More Epochs vs. More Data for SLMs