SinLlama: Sri Lanka launches largest Sinhala LLM with 10 million sentences

August 27, 2025 at 7:20 PM

Research students at the Department of Computer Science and Engineering, University of Moratuwa have developed the country’s first large-scale large language model (LLM) that exclusively include Sinhala, a breakthrough in advancing local language computing.

This project was jointly supervised by Dr Surangika Ranathunga (Massey University, New Zealand), Dr Nisansa de Silva (University of Moratuwa) and Dr Rishemjit Kaur (Central Scientific Instruments Organisation, India).

The model, named “SinLlama,” was built by continually pre-training Llama-3-8B with nearly 10 million Sinhala sentences. According to the research team, SinLlama is the largest Sinhala LLM to date and has already outperformed Llama-3-8B on Sinhala text classification benchmarks.

Both the model and the dataset have been made freely available for researchers and innovators.

The researchers said the release of SinLlama and its accompanying 10 million sentence dataset is expected to support wider research and innovation, ensuring that local languages thrive in the emerging AI era.

More information and access to the dataset are available online.

The 10million dataset is also publicly available: lnkd.in/gi43HaXg

More information: lnkd.in/gc3A4jUt