TinyGSM

TinyGSM — Training tiny models on grade school math

Dumb model trained on simple math = smarter model?

Background

In 2023, the TinyGSM paper from Microsoft showed that training a Small Language Model on Grade School Math helps it out-perform much larger models on the GSM8K math benchmark.

Original TinyGSM results

This was achieved by generating synthetic data from GPT-3.5 and training fine-tuned small models on that data. The fine-tuned model uses code to solve math problems.

Simple math problem example

My Approach: Quality & Diversity

Since 2023, there are wayyyy more Large Language Models and they are magnitudes better than GPT-3.5.

After reading a BabyAGI Paper about how teacher model diversity improves the student fine-tuned models, I decided to replicate the TinyGSM experiment with multiple high-quality teacher models:

  • GPT 4.1, GPT 4.1 mini & GPT 4.1 nano (Azure AI Foundry)
  • o4-mini (Azure AI Foundry)
  • Llama3.3 70B (AWS Bedrock)
  • Mixtral 8x7B (AWS Bedrock)
  • Deepseek R1 (Cloudrift)
  • Llama3.1 8B (Local GPU)

How to Pay for All This?

Coins

The original TinyGSM dataset contains 1.8B tokens (12 million question-answer pairs) and costs $3,600 to generate. Unlike Microsoft, I'm a startup founder. So I tried to do this as cheaply as possible:

  • Azure AI Foundry: $1,000 in Microsoft Startup credits
  • AWS Bedrock: $200 for mentoring at an AWS Hackathon
  • Prime Intellect: $2,000 from Inflection Grant
  • CloudRift: $800 in GPU credits

Overall $1,100 in cloud credits were spent.

Synthetic Data Generation

Synthetic data generation pipeline

After running the generator for 7 days straight on my Mac Mini, I created 12 TinyGSM sub-datasets in three types:

  • Instructed: LLM given explicit examples of how code should be generated
  • No-example: No example given, code is free-style
  • Reasoning: Reasoning LLM answers with their reasoning chains

12 sub-datasets

Data Quality Findings

  • GPT4.1 & GPT4.1 mini: Almost 100% correct responses
  • o4-mini & Deepseek R1: Amazing reasoning capabilities
  • Llama3.3 70B & Llama3.1 8B: Frequently correct but overly verbose
  • Mixtral 8x7B: Quality is horrible, non-working code & missing details
  • GPT 4.1 nano: Also bad quality

When given no examples, model responses are more verbose and less correct. When given a chance to reason, models perform much better.

Model Fine-tuning

Training time

Fine-tuning a Gemma3-270M model — small enough to run on Raspberry Pi 2 W and insanely fast on Orange Pi 5. Using Unsloth for fine-tuning.

Resources

Get Lab Notes

Experiments, builds & behind-the-scenes from the workshop. No spam.