A framework for development of Small Language Models for Indian regional languages, serving both as a practical alternative to LLMs and as a foundation for comparative analysis of tokenization strategies, machine translation performance, and linguistic complexity
Link to Our Paper
Must read
✨ All Models & Datasets can be found after this README section! ✨
- Refer to our GitHub to find results, extensive guides, resources and code for our TinyStories-Regional framework, which extends
the TinyStories approach (Eldan & Li, 2023) to three major Indian languages:Hindi, Marathi, and Bangla
.- Our framework enables training and inference of
Small Language Models (SLMs)
ranging from5M to 150M
parameters, which are subsequently employed as proxies for a variety of comparative analyses
A special thanks to
- TensorDock for providing compute! Check them out for easy-to-deploy and affordable GPU/CPU VMs 💚
- Microsoft for inspiring us with their original TinyStories paper 💙
- Sarvam, SUTRA, and Andrej Karpathy for their open-source efforts ❤️
Sarvam-1 tokenizer outperforms the SUTRA-mlt256-v2
across key linguistic metrics—context, completeness, creativity, fluency, and grammar—on inference-time story generation.Synthetic data outperforms machine-translated data
, highlighting the limitations of current translation tools compared to SOTA LLM-generated content.Marathi most challenging
, followed by Bengali, with Hindi being the easiest for generating high-quality inferences. This is seen across SOTA LLMs (4o, gemini-1.5) and our SLMs.Pipeline
free
)free
using gpt4free)CPU inference
supported, ~free
)Total Cost
to generate your custom Regional-SLM ~15 USD
:)First Time Setup Effort
:2-6 hours
; Time is money, after all :)Model Size | Training Time (1x H100 80GB) | Cost ($2.0/hr) |
---|---|---|
5M | ~6 hr | ~12 USD |
54M | ~8 hr | ~16 USD |
157M | ~16 hr | ~32 USD |
If you use Vizuara's TinyStories Regional in your research, please cite us using the following BibText template:
@misc{patil2025regionaltinystoriesusing,
title={Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance},
author={Nirvan Patil and Malhar Abhay Inamdar and Agnivo Gosai and Guruprasad Pathak and Anish Joshi and Aryan Sagavekar and Anish Joshirao and Raj Dandekar and Rajat Dandekar and Sreedath Panat},
year={2025},
eprint={2504.07989},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.07989},
}