TinyStories Regional

A framework for development of Small Language Models for Indian regional languages, serving both as a practical alternative to LLMs and as a foundation for comparative analysis of tokenization strategies, machine translation performance, and linguistic complexity

Link to Our Paper

arXiv

Must read

  • ✨ All Models & Datasets can be found after this README section! ✨
  • Refer to our GitHub to find results, extensive guides, resources and code for our TinyStories-Regional framework, which extends
    the TinyStories approach (Eldan & Li, 2023) to three major Indian languages: Hindi, Marathi, and Bangla.
  • Our framework enables training and inference of Small Language Models (SLMs) ranging from 5M to 150M parameters, which are subsequently employed as proxies for a variety of comparative analyses

A special thanks to

  • TensorDock for providing compute! Check them out for easy-to-deploy and affordable GPU/CPU VMs 💚
  • Microsoft for inspiring us with their original TinyStories paper 💙
  • Sarvam, SUTRA, and Andrej Karpathy for their open-source efforts ❤️

Key Findings

  • The Sarvam-1 tokenizer outperforms the SUTRA-mlt256-v2 across key linguistic metrics—context, completeness, creativity, fluency, and grammar—on inference-time story generation.
  • Synthetic data outperforms machine-translated data, highlighting the limitations of current translation tools compared to SOTA LLM-generated content.
  • Language models find Marathi most challenging, followed by Bengali, with Hindi being the easiest for generating high-quality inferences. This is seen across SOTA LLMs (4o, gemini-1.5) and our SLMs.

⚙️ Usage

Model weights and inference support on HuggingFace will be available soon!

Datasets can be downloaded (below) as .json files. Please check the format of each entry using the dataset viewer tab :)


💰 Costs

🛠️ Overview

  • Pipeline
    • Prompt Generation (free)
    • Data Generation (free using gpt4free)
    • Training a SLM (<20 USD using TensorDock 💚)
    • Inference (CPU inference supported, ~free)
  • Total Cost to generate your custom Regional-SLM ~15 USD :)
  • First Time Setup Effort:
    • Assuming intermediate competancy with DL and LLMs
    • 2-6 hours; Time is money, after all :)

⚙️ Hardware Details

Model Size Training Time (1x H100 80GB) Cost ($2.0/hr)
5M ~6 hr ~12 USD
54M ~8 hr ~16 USD
157M ~16 hr ~32 USD
  • Using 2×H100 doubles both VRAM and hourly cost but halves the training time, keeping the total cost unchanged.
  • As training is VRAM-bound rather than FLOP or architecture-dependent, the RTX A6000 offers the best cost efficiency per GB of VRAM.

📝 Citation

If you use Vizuara's TinyStories Regional in your research, please cite us using the following BibText template:

@misc{patil2025regionaltinystoriesusing,
      title={Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance}, 
      author={Nirvan Patil and Malhar Abhay Inamdar and Agnivo Gosai and Guruprasad Pathak and Anish Joshi and Aryan Sagavekar and Anish Joshirao and Raj Dandekar and Rajat Dandekar and Sreedath Panat},
      year={2025},
      eprint={2504.07989},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.07989}, 
}