SupraLabs | Hidden Topology: Depth vs. Width Scaling for SLMs

Experiment #4:
Hidden Topology — Depth vs. Width

An isolated geometric sweep allocating a uniform parameter budget (~5M) to explore structural limits. We stressed tested deep sequential layers against shallow parallel computing layers over 50,000,000 unique tokens using local bfloat16 hardware accelerator engines.

Shallow & Wide Monopolizes the Megabyte Scale

Traditional deep-network assumptions state that depth unlocks highly abstract multi-step logic circuits. Our empirical architectural investigation completely reverses this paradigm for Small Language Models (SLMs) under 10M parameters:

The Bottleneck of Extreme Depth: Forcing a 4.2M model into a 12-layer deep structure shrinks internal tracking states down to an informational choke point. The layers are too narrow to retain text variations simultaneously.
Massive Parallel Compute Highways: Truncating the architecture to 3 layers while expanding layer dimensions (Hidden: 256) establishes optimal neuronal real-estate. The weights map raw token dependencies instantly in parallel graphs.
Hardware Acceleration Exploitation: Because sequential deep layer dependencies are cut, GPU execution pipelines run unfettered. Model througput spikes by 91%, delivering optimal performance profiles at minimal energy footprint.

Controlled Architectural Sweeps

Pretrained via optimized hyperparameter constants. Perplexity (PPL) and Lower Pretrain Loss track the precision of language acquisition.

Structural Metric	Exp 1: Deep & Narrow	Exp 2: Balanced Baseline	Exp 3: Shallow & Wide (🏆 Win)
Layer Configuration	12 Layers (Hidden: 128)	6 Layers (Hidden: 192)	3 Layers (Hidden: 256)
Active Model Parameters	4,197,504	5,114,304	5,244,672
Pretrain Step Loss (↓)	4.539	4.345	4.188
Pretrain Train Loss (↓)	5.567	5.302	5.093
ARC-Easy Zero-Shot (↑)	29.17%	29.63%	29.97%
Wikitext Word PPL (↓)	817.8844	585.5899	418.6314
Compute Throughput (⚡)	218.7 samples/sec	282.9 samples/sec	417.9 samples/sec
EFFICIENCY MATRIX	Severe Structural Bottleneck	Intermediate Compression	SOTA TOPOLOGY SWEETSPOT

The Hardware Advantage: Active Processing Speeds

Fewer structural steps cut matrix synchronization gaps, unlocking maximum execution parallelization on local Tensor Core engines.

SupraLabs_

Experiment #4:
Hidden Topology — Depth vs. Width

Shallow & Wide Monopolizes the Megabyte Scale

Controlled Architectural Sweeps

Downstream Perplexity Collapse vs. Topology

The Hardware Advantage: Active Processing Speeds

Experiment #4:Hidden Topology — Depth vs. Width

Shallow & Wide Monopolizes the Megabyte Scale

Controlled Architectural Sweeps

Downstream Perplexity Collapse vs. Topology

The Hardware Advantage: Active Processing Speeds

Experiment #4:
Hidden Topology — Depth vs. Width