Regional-TinyStories (IJCNLP-AACL '25)

Regional-TinyStories: A Small Language Model Framework for Evaluating Language Learning, Tokenizers, and Datasets

Link to Our Paper

arXiv

Must read

  • โœจ All Models & Datasets can be found after this README section! โœจ
  • Refer to our GitHub to find results, extensive guides, resources and code for our TinyStories-Regional framework, which extends
    the TinyStories approach (Eldan & Li, 2023) to three major Indian languages: Hindi, Marathi, and Bangla.

A special thanks to

  • TensorDock for providing compute! Check them out for easy-to-deploy and affordable GPU/CPU VMs ๐Ÿ’š
  • Microsoft for inspiring us with their original TinyStories paper ๐Ÿ’™
  • Sarvam, SUTRA, and Andrej Karpathy for their open-source efforts โค๏ธ

Abstract

Small, resource-efficient language models are pivotal for extending high-quality text generation to low-resource and regional languagesโ€”the true frontier of linguistic equity in AI. Yet research largely prioritises massive English-centric systems, leaving regional-centric (low-resource) language modelling underexplored, particularly how tokenizer design, dataset diversity, and linguistic structure shape a Small Language Modelsโ€™ (SLMs) effectiveness under realistic computational and data constraints. We present \emph{Regional-TinyStories}, a lightweight framework that treats SLMs as cost-effective stand-ins for LLMs to enable rapid, variable-wise analysis. Extending TinyStories to Hindi, Marathi, and Bangla, we release datasets of 2M synthetic and translated stories per language and train over 20 SLMs spanning 5โ€“157M parameters. Using this framework, we (i) uncover contrasts between form-oriented (grammar, fluency) and content-oriented (context, completeness, creativity) metrics; (ii) chart language-specific learning dynamics; (iii) rank tokenizers, showing Indic-specific \textsc{Sarvam-1} outperforming \textsc{SUTRA} and generic \textsc{Tiktoken (GPT-2)} across all metrics; and (iv) demonstrate that dataset semantic quality (translation vs.\ synthetic) strongly governs downstream generation. Validation through an LLM-as-Judge ensemble (GPT-4o, LLaMA-3.3-70B) and a 100+ participant human study confirms these trends while exposing systematic score inflation in automated evaluations. \emph{Regional-TinyStories} offers a reproducible path to benchmark tokenizers, datasets, and SLM designs for scalable, context-faithful generation in low-resource


โš™๏ธ Usage


๐Ÿ’ฐ Costs

๐Ÿ› ๏ธ Overview

โš™๏ธ Hardware Details

Model Size Training Time (1x H100 80GB) Cost ($2.0/hr)
5M ~6 hr ~12 USD
54M ~8 hr ~16 USD
157M ~16 hr ~32 USD

๐Ÿ“ Citation

If you use Vizuara's Regional-TinyStories in your research, please cite us using the following BibText template:

@misc{patil2025regionaltinystoriesusing,
      title={Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance}, 
      author={Nirvan Patil and Malhar Abhay Inamdar and Agnivo Gosai and Guruprasad Pathak and Anish Joshi and Aryan Sagavekar and Anish Joshirao and Raj Dandekar and Rajat Dandekar and Sreedath Panat},
      year={2025},
      eprint={2504.07989},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.07989}, 
}