An empirical evaluation of structured knowledge injection schedules inside parameter-constrained spaces. We subjected a optimized 4,196,096 parameter architecture to a rigid budget of 200,000,000 total tokens to test multi-distribution blending against raw uniform pooling.
When training extremely compact language models, data sequencing determines memory stabilization boundaries. Our comparative routing sweep demonstrates that standard dataset pooling bottlenecks the structural capability of micro-networks:
All training pipelines utilize an identical 4096 vocabulary architecture. Downstream tracking processed zero-shot. Deceptive optimization targets (apparent pretrain loss) are directly falsified by true language modeling benchmarks.
| Benchmark / Metric | Run 1: Shifting Weights (🏆 Win) | Run 2: Hard Sequential Blocks | Run 3: Uniform Baseline Mix |
|---|---|---|---|
| Routing Architecture | Dynamic Weight Stepper | Static Linear Switches | Star Blend Pooling |
| Final Pretrain Step Loss (↓) | 2.770 | 2.458 | 2.690 |
| Final Pretrain Train Loss (↓) | 3.151 | 2.817 | 3.114 |
| ARC-Easy Zero-Shot (↑) | 29.71% | 30.22% | 30.18% |
| ARC-Easy Acc Norm (↑) | 30.26% | 30.64% | 30.47% |
| Wikitext Byte PPL (↓) | 3.1392 | 3.3410 | 3.2618 |
| Wikitext Word PPL (↓) | 453.6715 | 632.9729 | 556.7454 |
| PRETRAIN COHERENCE | SOTA BALANCE ARCHIEVED | Severe Memory Corruption | Information Dilution |
Critical discovery: Rigid data blocking pushes the localized training loss down artificially, but causes severe out-of-distribution language collapse on core evaluation matrices.