Starmind Pico

Starmind Pico — LM on a microcontroller

Is it possible to train a Language Model to run on the RP2040? Yes. Dumb but fast.

Background

I had a TinyPico (RP2040) laying around the office for the last 8 months. During a recent hackathon, I decided to whip it out for a weekend project and thought to myself — I bet I can make a Language Model run on this.

Hackathon

What I set out to do:

  • Find a Language Model architecture that runs well on RP2040
  • Train a model
  • Run the model

Architecture Analysis

I analyzed 5 factors of the Language Model Architecture and how they affect model inference speed and quality:

1. Dimension Size

Dimension Size — 40-50% speed loss per doubling

2. Layer Depth

Layer Depth — 25-40% speed loss per additional layer

3. Attention Heads

Attention Heads — 20-25% speed loss per doubling

4. FFN Ratio

FFN Ratio — 15-20% speed loss per doubling

5. Vocabulary Size

Vocabulary Size — 8-12% speed loss per doubling

176 architectural variants tested — the most comprehensive microcontroller transformer study ever.

Key Findings

Speed Results

  • Fastest model: 32.0 tokens/second (1D architecture)
  • Most balanced: 2-3 tokens/second (8K parameters)
  • Memory limit: 256 vocabulary tokens maximum due to SRAM fragmentation

Microcontroller-Specific Insights

  • Quantization paradox: Saves memory but slows down inference due to de-quantization overhead
  • RP2040 bottleneck: Computation speed, not memory bandwidth
  • KV caching: 3-7x speed improvement for multi-token generation
# Maximum Speed (1K params)
optimal_speed = {
    'vocab_size': '32-64',
    'dim': '1-2',
    'hidden_dim': '64-128',
    'n_layers': 1,
    'n_heads': '6-8',
    'expected_speed': '20-32 tok/s'
}

# Balanced Production (8K params)
balanced_production = {
    'vocab_size': '256-512',
    'dim': '6-8',
    'hidden_dim': '192-256',
    'n_layers': '2-3',
    'n_heads': 8,
    'expected_speed': '2-5 tok/s'
}

Training

Trained models on an H100 rented on Prime Intellect. Each model takes ~2 minutes to train. All initial training took 12 hours (about $20 total).

Conclusion

Due to memory fragmentation on RP2040, the maximum vocabulary size is 256. After spending $20+ in GPU for pre-training, the models generate a few coherent words, which is promising. People on Reddit liked it.

Resources

Get Lab Notes

Experiments, builds & behind-the-scenes from the workshop. No spam.