Regional-TinyStories (IJCNLP-AACL '25)

Regional-TinyStories: A Small Language Model Framework for Evaluating Language Learning, Tokenizers, and Datasets

Link to Our Paper

Must read

✨ All Models & Datasets can be found after this README section! ✨

Refer to our GitHub to find results, extensive guides, resources and code for our TinyStories-Regional framework, which extends
the TinyStories approach (Eldan & Li, 2023) to three major Indian languages: Hindi, Marathi, and Bangla.

A special thanks to

TensorDock for providing compute! Check them out for easy-to-deploy and affordable GPU/CPU VMs 💚

Microsoft for inspiring us with their original TinyStories paper 💙

Sarvam, SUTRA, and Andrej Karpathy for their open-source efforts ❤️

Abstract

Small, resource-efficient language models are pivotal for extending high-quality text generation to low-resource and regional languages—the true frontier of linguistic equity in AI. Yet research largely prioritises massive English-centric systems, leaving regional-centric (low-resource) language modelling underexplored, particularly how tokenizer design, dataset diversity, and linguistic structure shape a Small Language Models’ (SLMs) effectiveness under realistic computational and data constraints. We present \emph{Regional-TinyStories}, a lightweight framework that treats SLMs as cost-effective stand-ins for LLMs to enable rapid, variable-wise analysis. Extending TinyStories to Hindi, Marathi, and Bangla, we release datasets of 2M synthetic and translated stories per language and train over 20 SLMs spanning 5–157M parameters. Using this framework, we (i) uncover contrasts between form-oriented (grammar, fluency) and content-oriented (context, completeness, creativity) metrics; (ii) chart language-specific learning dynamics; (iii) rank tokenizers, showing Indic-specific \textsc{Sarvam-1} outperforming \textsc{SUTRA} and generic \textsc{Tiktoken (GPT-2)} across all metrics; and (iv) demonstrate that dataset semantic quality (translation vs.\ synthetic) strongly governs downstream generation. Validation through an LLM-as-Judge ensemble (GPT-4o, LLaMA-3.3-70B) and a 100+ participant human study confirms these trends while exposing systematic score inflation in automated evaluations. \emph{Regional-TinyStories} offers a reproducible path to benchmark tokenizers, datasets, and SLM designs for scalable, context-faithful generation in low-resource

⚙️ Usage

Datasets can be downloaded (below) as .json files. Please check the format of each entry using the dataset viewer tab :)
Model weights from HF can be loaded for inference via the config.py script provided in our GitHub.

💰 Costs

🛠️ Overview

Pipeline
- Prompt Generation (free)
- Data Generation (free using gpt4free)
- Training a SLM (<20 USD using TensorDock 💚)
- Inference (CPU inference supported, ~free)
Total Cost to generate your custom Regional-SLM ~15 USD :)
First Time Setup Effort:
- Assuming intermediate competancy with DL and LLMs
- 2-6 hours; Time is money, after all :)

⚙️ Hardware Details

Model Size	Training Time (1x H100 80GB)	Cost ($2.0/hr)
5M	~6 hr	~12 USD
54M	~8 hr	~16 USD
157M	~16 hr	~32 USD

Using 2×H100 doubles both VRAM and hourly cost but halves the training time, keeping the total cost unchanged.
As training is VRAM-bound rather than FLOP or architecture-dependent, the RTX A6000 offers the best cost efficiency per GB of VRAM.

📝 Citation

If you use Vizuara's Regional-TinyStories in your research, please cite us using the following BibText template:

@misc{patil2025regionaltinystoriesusing,
      title={Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance}, 
      author={Nirvan Patil and Malhar Abhay Inamdar and Agnivo Gosai and Guruprasad Pathak and Anish Joshi and Aryan Sagavekar and Anish Joshirao and Raj Dandekar and Rajat Dandekar and Sreedath Panat},
      year={2025},
      eprint={2504.07989},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.07989}, 
}