Experiment #7:
Information Routing — Curriculum Shifting

An empirical evaluation of structured knowledge injection schedules inside parameter-constrained spaces. We subjected a optimized 4,196,096 parameter architecture to a rigid budget of 200,000,000 total tokens to test multi-distribution blending against raw uniform pooling.

// Syntactic_Preservation_&_Memory_Retainment

Curing Catastrophic Forgetting via Shifting Weights

When training extremely compact language models, data sequencing determines memory stabilization boundaries. Our comparative routing sweep demonstrates that standard dataset pooling bottlenecks the structural capability of micro-networks:

// Knowledge_Injection_Routing_Matrix

Empirical Scheduling Analytics

All training pipelines utilize an identical 4096 vocabulary architecture. Downstream tracking processed zero-shot. Deceptive optimization targets (apparent pretrain loss) are directly falsified by true language modeling benchmarks.

Benchmark / Metric Run 1: Shifting Weights (🏆 Win) Run 2: Hard Sequential Blocks Run 3: Uniform Baseline Mix
Routing Architecture Dynamic Weight Stepper Static Linear Switches Star Blend Pooling
Final Pretrain Step Loss (↓) 2.770 2.458 2.690
Final Pretrain Train Loss (↓) 3.151 2.817 3.114
ARC-Easy Zero-Shot (↑) 29.71% 30.22% 30.18%
ARC-Easy Acc Norm (↑) 30.26% 30.64% 30.47%
Wikitext Byte PPL (↓) 3.1392 3.3410 3.2618
Wikitext Word PPL (↓) 453.6715 632.9729 556.7454
PRETRAIN COHERENCE SOTA BALANCE ARCHIEVED Severe Memory Corruption Information Dilution
// Visualizing_Routing_Mechanics

Linguistic Perplexity Splintering Across Routing Approaches

The Apparent Loss Paradox: Pretrain Loss vs. True Structural Perplexity

Critical discovery: Rigid data blocking pushes the localized training loss down artificially, but causes severe out-of-distribution language collapse on core evaluation matrices.

TOTAL COMPUTE EXPOSURE 200M Unified Steps
MODEL GEOMETRY PROFILE 3-Layer Shallow Wide SOTA
SCHEDULING RESOLUTION 4096 Vocab Pareto Bound