The majority of natural language processing (NLP) research today revolves around multilingual models that leverage massive pre-trained architectures like mBERT or GPT-3. The idea is simple: fine-tune a large model on your language of interest. But what happens when you strip away these enormous pre-trained models and build something from the ground up?
In this post, we share our journey of building Msingi1, a Swahili-first language model trained entirely from scratch. With no reliance on multilingual corpora, no pretraining on English data, and no gigantic parameters, Msingi1 represents a small but significant step towards understanding how language models can be created for low-resource languages.
Our primary goal was simple:
Can a tiny GPT-style model, trained on just Swahili data, learn to generate meaningful text?
After training Msingi1 for 4 hours on a T4 GPU, here’s what we learned.
The current landscape of NLP is dominated by fine-tuning pre-trained models, like mBERT or GPT-2. These models are pre-trained on vast multilingual corpora that include English, Chinese, French, and hundreds of other languages. While this approach works for many languages, especially high-resource ones, it comes with several problems for low-resource languages:
By training a model from scratch, we allow the system to learn directly from the raw Swahili text — no interference from English, no pre-baked assumptions. This approach also allows us to explore how small models can learn without the crutch of pretraining on a language the model didn’t natively understand.
We opted for a GPT-style transformer with 6 layers and a relatively small number of parameters. This choice ensured that we could train the model with limited computational resources while still experimenting with a model architecture that reflects those used by much larger models like GPT-3.
| Component | Value |
|---|---|
| Layers | 6 |
| Hidden size | 384 |
| Attention heads | 6 |
| Parameters | ~85M |
| Sequence length | 512 tokens |
| Architecture | Decoder-only |
At ~85 million parameters, this model is small by modern standards — significantly smaller than GPT-3 or even GPT-2. Yet it still contains enough complexity to model the intricate syntax and semantics of Swahili, a Bantu language with complex noun-class systems and verb morphology.
This decision to go small was intentional. We didn’t want to take shortcuts or rely on enormous compute power. We wanted to explore what a compact model can achieve and examine where its boundaries lie.
The dataset we used for Msingi1 is diverse, spanning across several genres of Swahili text. We collected approximately 45 million tokens from the following sources:
This diverse corpus ensured that the model would be exposed to a wide range of Swahili, from formal language in news reports to more colloquial expressions in online discussions.
For tokenization, we decided to use a Unigram tokenizer with a vocabulary size of 32,000. We chose Unigram because it strikes a balance between efficiency and handling rare words — important for Swahili’s rich morphology.
The SentencePiece tokenizer allowed us to manage the nuances of Swahili’s morphology, which can often result in long compound words. Unlike English, Swahili often uses affixes, prefixes, and suffixes that change the meaning of the root word significantly, so choosing the right tokenization method was critical.
Training Msingi1 was a real test of patience. We used a T4 GPU with a batch size of 8, training for 4 hours. Given the size of the model and the hardware constraints, this was an aggressive but necessary choice to understand how quickly the model could converge on meaningful patterns.
Despite the limited batch size, the model showed stable training dynamics. Loss decreased consistently without any signs of divergence or overfitting. This is a clear indication that even smaller models, when trained from scratch on the right data, can learn meaningful representations.
The relatively small batch size was manageable on the T4, but there were certainly limitations. Longer training durations and larger batches could likely improve the model’s final performance, but this small-scale experiment was designed to test the core viability of training such a model.
We experimented with text generation to evaluate how well Msingi1 can handle the Swahili language. The most interesting results came from the sampling parameters we used. We varied the temperature, top-k, and repetition penalties to see how the model’s outputs would change.
Prompt:
Habari ya leo ni
Generated Output:
Habari ya leo ni mbili sheria sheria sana eneo tena jeshi bila...
🧠 Observation: The model is able to generate valid Swahili syntax, although repetition of words like "sheria" (law) occurs frequently. This type of behavior is common in smaller models, where word repetition may occur due to the model’s limited context window.
Prompt:
Rais wa Jamhuri ya Muungano wa
Generated Output:
Rais wa Jamhuri ya Muungano wa Tanzania ametangaza kuwa uchaguzi mkuu utafanyika mwezi ujao.
✅ Observation: The model generates more coherent output when the prompt is formal and contextually rich. The sentence is both grammatically correct and contextually relevant, demonstrating that the model has learned how to handle formal Swahili.
These experiments showed that Msingi1 is capable of generating meaningful Swahili text, though it still needs improvements in fluency and repetition control.
From our journey with Msingi1, we uncovered several key lessons that will inform our next steps:
Training a model from scratch for Swahili demonstrates that low-resource languages don’t need to rely on multilingual models or massive datasets to develop functional, meaningful models.
This experiment highlights the potential for building language-specific models for African languages and other low-resource languages. It's not about creating the biggest model possible — it's about building a foundation that reflects the structure and nuances of the language itself.
We plan to:
Language models are not just about scaling up. They’re about understanding, interpreting, and reflecting the languages we work with. Msingi1 is a small but meaningful step toward creating equitable NLP tools that respect language and culture, from the ground up.
Follow @MsingiAI for future updates as we continue to refine and expand this exciting project.