From Switch Transformers to DeepSeek-V2 โ how conditional computation reshapes large-scale AI
Over the past five years, model scaling has followed a predictable recipe โ double the parameters, wait for better results. But dense scaling quickly hits physical and financial limits.
A 70-billion-parameter dense model requires terabytes of GPU memory. Inference cost grows linearly with parameters, even if much of the network isn't needed for a given input.
Observation: Most inputs only need a small subset of model parameters to reason effectively. A movie-review sentiment doesn't need the same neurons that decode protein structures. That insight gave rise to Mixture-of-Experts (MoE) architectures.
Instead of running all layers for every token, MoE introduces experts (sub-modules) and a router that dynamically decides which experts to activate.
Let's denote:
where gแตข(x) โ 0 for most experts (sparse gating).
If only k experts are activated per token:
Hence, a model can hold billions of parameters but run like a much smaller one.
Component | Function |
---|---|
Router / Gating Network | Chooses top-k experts for each token |
Experts | Independent FFNs or modules, often replicated per layer |
Load Balancer | Ensures tokens are distributed fairly (avoids "hot" experts) |
Sparse Dispatch / Combine | Efficiently route inputs to selected experts and aggregate outputs |
Simplest and most elegant variant: one expert per token (k = 1).
Router picks top-1 expert via softmax gating:
During training, a load-balancing loss encourages uniform utilization:
1.6-trillion-parameter model trained with compute of a 10B dense transformer
Advantages: Fast, simple, scales linearly
Drawback: Information bottleneck if wrong expert is picked
Extends Switch with top-2 experts per token and weighted combination.
Introduces expert parallelism (each GPU hosts different experts).
Total parameters: 1.2T; active per token: 97B โ 12ร efficiency
with load-balancing loss:
Groups of experts specializing by task or modality (code, math, vision)
Top-level decides expert group; sub-router picks within
Small experts merge their gradients periodically to share knowledge
Reported 10ร throughput and comparable quality to GPT-4-class models.
Goal: See how routing & load balancing work in practice.
import torch, torch.nn as nn, torch.nn.functional as F
class SimpleMoE(nn.Module):
def __init__(self, d_model=512, num_experts=4, k=2):
super().__init__()
self.num_experts, self.k = num_experts, k
self.experts = nn.ModuleList([nn.Sequential(
nn.Linear(d_model, 2048),
nn.ReLU(),
nn.Linear(2048, d_model)
) for _ in range(num_experts)])
self.gate = nn.Linear(d_model, num_experts)
def forward(self, x):
# x: [batch, seq, d_model]
logits = self.gate(x) # [B,S,E]
scores = F.softmax(logits, dim=-1)
topk = torch.topk(scores, self.k, dim=-1)
indices, weights = topk.indices, topk.values
# dispatch
out = 0
for i in range(self.k):
expert_out = self.experts[indices[..., i]](
x.clone()
) * weights[..., i].unsqueeze(-1)
out += expert_out
return out
This toy version activates k experts per token. In practice, frameworks like DeepSpeed-MoE, Fairseq-MoE, and Megatron-LM implement optimized all-to-all communication for dispatching tokens across GPUs.
Without constraints, some experts dominate; others never train. Common balancing techniques:
Method | Idea | Equation / Mechanism |
---|---|---|
Auxiliary Loss | Penalize uneven traffic | Laux = C ฮฃแตข fi pi |
Noise Jitter | Adds randomness to gate logits | g(x) = softmax(xWg + ฯต) |
Token Drop | Skip overflow tokens to cap load | Ensures deterministic batch size |
Capacity Factor (ฮฑ) | Max tokens per expert | capacity = ฮฑ ยท tokens/experts |
Routing decides which sub-model to fire per query type, keeping inference cost constant even as the system's knowledge base expands.
Separate experts for credit, market, operational, climate risks. Shared embedding + router allows dynamic switching across domains.
Experts specialized on molecule synthesis, toxicology, trial protocol, post-market surveillance. Router learns to send prompts to appropriate scientific expert automatically.
In DataXpert, a hierarchical MoE router orchestrates: a Data Science Expert (for numerical analysis), a Programming Expert (for code synthesis), and a Business Expert (for KPI explanation) โ forming an Agentic MoE system that mimics real consulting workflows.
Model | Total Params | Active Params | Speed-up vs Dense | Paper |
---|---|---|---|---|
Switch Transformer | 1.6T | 10B | 4ร | Google, 2021 |
GLaM | 1.2T | 97B | 12ร | Google, 2022 |
DeepSeek-V2 | 671B | 37B | 10ร | DeepSeek, 2024 |
These results prove that conditional compute beats brute force โ unlocking trillion-parameter capacity at sub-100-billion-parameter cost.
โโโโโโโโโโโโโโ
Input โโโโถ โ Shared Encoder โ
โโโโโโโโฌโโโโโโ
โ
โโโโโโโโโโโโโโ
โ Router NN โโโโโ
โโโโโโโโโโโโโโ โ
โ Top-k โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโ
โ Expert-1 Expert-2 ... โ โ each trained on sub-domain data
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ
Aggregate + FFN
โ
Output / Logits
At runtime, the router picks a few experts per token โ often dispatched across GPUs via AllToAll communication primitives.
Dimension | Dense | Mixture-of-Experts |
---|---|---|
Parameters active per token | All | k โช N |
Compute efficiency | Low | High |
Training stability | Stable | Requires careful balancing |
Memory footprint | Scales with N | Scales with k |
Inference throughput | Linear | Sub-linear |
Interpretability | Uniform | Experts offer explainable modularity |
Emerging research trends:
Coarse task selector โ fine expert (DeepSeek-V2)
Text, image, code unified in one MoE
Dynamic expert spawning for new domains
Multiple autonomous LLM agents specialized by role โ precisely the paradigm Finarb is building for its multi-agent data-analytics systems
Metric | Dense Model | MoE Model |
---|---|---|
Training compute | 100% baseline | 25โ30% |
Inference latency | โ linear | constant (k active) |
Energy cost | High | Reduced |
Scalability | Limited by GPU RAM | Horizontally scalable across experts |
Domain adaptation | Full retrain | Add expert module only |
MoE fundamentally shifts the economics of AI โ enabling enterprises to own large, modular AI systems that scale capacity without scaling cost.
MoE formalizes conditional computation โ selectively using parts of a massive network โ analogous to how human brains recruit specialized cortical regions per task.
Mathematically:
where p = k/N.
Thus, you can increase N arbitrarily while keeping compute fixed by reducing p โ the essence of scaling "horizontally" instead of "vertically."
Mixture-of-Experts architectures mark a paradigm shift:
For enterprises, that means AI systems that grow without growing costs โ experts that specialize by function, department, or domain, much like a real organization.
At Finarb, this principle already powers our internal multi-agent products: KPIxpert with specialized expert modules for KPI optimization, and DataXpert with MoE-style orchestration of domain-specific data experts. Together, they exemplify how applied innovation meets scalable intelligence.