Hamid Wakili Engineering the future

Training an NLP Model for Chichewa Language Understanding

In the realm of natural language processing (NLP), the majority of models and resources are built for widely spoken languages like English or Spanish. However, with over 10 million speakers, Chichewa is an important Bantu language in Southern Africa, especially in Malawi. This blog post explores my journey in training an NLP model to understand Chichewa, the challenges I faced, and the insights gained from working with an under-resourced language.

Project Goals

The primary goal of this project was to build a model capable of converting Chichewa audio to text. I aimed to achieve high accuracy, particularly measured through metrics like Word Error Rate (WER) and Character Error Rate (CER).

Dataset

To train the model, I created a custom dataset consisting of Chichewa audio clips paired with transcriptions. Each audio file was in .wav format and linked with a corresponding text file containing the transcription. I then converted this data into a CSV format with fields for wav_filename, wav_filesize, and transcript, making it compatible with the model’s training pipeline.

Training Process

Using tools like TensorBoard, I monitored the model’s performance over time. The step loss graph (as shown in the image) reveals the model’s improvement as the loss steadily decreases over thousands of steps. This shows that the model is progressively learning to align the Chichewa audio features with their textual counterparts.

Tensorboard visualization

Key metrics:

  • WER (Word Error Rate): A low WER indicates that the model transcribes Chichewa audio with a high degree of accuracy.
  • CER (Character Error Rate): The CER further validates the accuracy at a more granular level by looking at individual characters rather than whole words.

Challenges

  1. Limited Resources: Unlike English, where large datasets are readily available, Chichewa lacks extensive language resources, making it necessary to curate and preprocess my own dataset.
  2. Pronunciation Variability: Chichewa pronunciation can vary regionally, affecting the model’s accuracy when exposed to different accents.
  3. Computational Requirements: Training an NLP model, especially one using deep learning, requires significant computational power. I used Google Colab with a T4 GPU, which helped manage training costs.

Results

The final model showed promising results with a WER of 0.0625 and a CER of 0.017876 on the test set. Sample transcriptions, as shown below, illustrate the model’s ability to accurately understand and transcribe Chichewa audio:

  • Sample 1: “koma ma potholes ali kwathu ku naperi uku umachita kuda nkhawa ukamabwerera kunyumba tu koma”
  • Sample 2: “mudzapha munthu za ziii”

Model inference

Conclusion

This project highlights the potential of NLP for African languages and the impact of developing language technology for underrepresented communities. Training an NLP model to understand Chichewa was challenging but rewarding, underscoring the importance of diverse language models. Going forward, I plan to refine the model further and explore its application in various Chichewa language tasks, from voice-activated assistance to real-time transcription.