Experiment #4:
Hidden Topology — Depth vs. Width

An isolated geometric sweep allocating a uniform parameter budget (~5M) to explore structural limits. We stressed tested deep sequential layers against shallow parallel computing layers over 50,000,000 unique tokens using local bfloat16 hardware accelerator engines.

// Structural_Discoveries_&_Compute_Autobahns

Shallow & Wide Monopolizes the Megabyte Scale

Traditional deep-network assumptions state that depth unlocks highly abstract multi-step logic circuits. Our empirical architectural investigation completely reverses this paradigm for Small Language Models (SLMs) under 10M parameters:

// Topology_Evaluation_Data

Controlled Architectural Sweeps

Pretrained via optimized hyperparameter constants. Perplexity (PPL) and Lower Pretrain Loss track the precision of language acquisition.

Structural Metric Exp 1: Deep & Narrow Exp 2: Balanced Baseline Exp 3: Shallow & Wide (🏆 Win)
Layer Configuration 12 Layers (Hidden: 128) 6 Layers (Hidden: 192) 3 Layers (Hidden: 256)
Active Model Parameters 4,197,504 5,114,304 5,244,672
Pretrain Step Loss (↓) 4.539 4.345 4.188
Pretrain Train Loss (↓) 5.567 5.302 5.093
ARC-Easy Zero-Shot (↑) 29.17% 29.63% 29.97%
Wikitext Word PPL (↓) 817.8844 585.5899 418.6314
Compute Throughput (⚡) 218.7 samples/sec 282.9 samples/sec 417.9 samples/sec
EFFICIENCY MATRIX Severe Structural Bottleneck Intermediate Compression SOTA TOPOLOGY SWEETSPOT
// Visualizing_Structural_Performance

Downstream Perplexity Collapse vs. Topology

The Hardware Advantage: Active Processing Speeds

Fewer structural steps cut matrix synchronization gaps, unlocking maximum execution parallelization on local Tensor Core engines.

COMPUTE TENSOR ENGINE Local bfloat16 Hardware Run
CONSTANT SEARCH ENGINE Optuna Peak LR 0.001178
OPTIMAL LAYOUT PROFILE 3-Layer Wide Configuration