SinLlama: Sri Lanka launches largest Sinhala LLM with 10 million sentences

August 27, 2025 at 7:20 PM

Researchers at the University of Moratuwa have developed the country’s first large-scale Sinhala-only large language model (LLM), a breakthrough in advancing local language computing.

The model, named “SinLlama,” was built by continually pre-training Llama-3-8B with nearly 10 million Sinhala sentences. According to the research team, SinLlama is the largest Sinhala LLM to date and has already outperformed Llama-3-8B on Sinhala text classification benchmarks.

Both the model and the dataset have been made freely available for researchers and innovators.

The project was led by the Department of Computer Science and Engineering.

Officials said the release of SinLlama and its accompanying 10 million sentence dataset is expected to support wider research and innovation, ensuring that local languages thrive in the emerging AI era.

More information and access to the dataset are available online.

The 10million dataset is also publicly available: lnkd.in/gi43HaXg

More information: lnkd.in/gc3A4jUt