Large language models (LLMs) can be viewed as the building blocks of GenAI, which harness deep learning techniques and large data sets to understand and generate content across speech, image and text formats. Apple’s Siri, Google’s Alexa and OpenAI’s ChatGPT are all incarnations of GenAI.
Language is therefore crucial for the construct of GenAI systems because it entails designing a set of instructions or cues that ensure the system understands the prompt and produces output or result that is contextually correct or relevant.
According to AI Singapore (AISG), as GenAI systems are trained predominantly based on information available from the Internet, existing LLMs can exhibit strong bias towards certain languages, cultural values, societal viewpoints, and even the use of pronouns. This is due to the training data having disproportionately large influences from Western, industrialised, rich, educated and democratic (WIRED) societies.
That is why SEA-LION was developed. SEA-LION stands for Southeast Asian languages in one network, and refers to the family of LLMs that is specifically pre-trained and instruct-tuned for the Southeast Asian region.
Spearheaded by AISG, SEA-LION is built on the MPT architecture and boasts a vocabulary size of 256K, which is a significant leap in the field of natural language processing (NLP). It comes in two variants – 3 billion parameters and another with 7 billion parameters.
Southeast Asia is a diverse region with more than a thousand native languages spoken and written, and these languages are largely underrepresented in pre-training data used for current LLMs. The SEA-LION model will focus on the more commonly used languages in the region for now, such as Bahasa Indonesia, Malay, Thai and Vietnamese, and will be extended to include other regional languages including Burmese and Lao.
SEA-LION also addresses cultural nuances and expressions unique to the region. For example, Thais are accustomed to using “5555” to indicate laughter, while the use of “wkwkwk” to do so is common among Indonesians.
AISG is working with Amazon Web Services (AWS), Google Research and SEACrowd to train and deploy SEA-LION, which is built upon the premise of creating an open and dynamic LLM ecosystem.
The compact size of SEA-LION renders training and deployment more efficient and cost-effective as compared with larger LLMs that can have hundreds of times more parameters. Smaller LLMs are purportedly easier, faster and cheaper to fine-tune and implement. Smaller file size, lower latency and shorter start-up time also enable smaller LLMs to run at the edge and on mobile devices.
Dr Leslie Teo, senior director of AI products at AISG, said SEA-LION represents a relatively inexpensive and efficient option for developers and enterprises to incorporate AI into their workflows, especially those that are cost-sensitive and throughput-constrained in Southeast Asia. The close collaborations with industry partners will help advance SEA-LION’s capabilities and accelerate its adoption by various organisations.
Elsie Tan, country manager of worldwide public sector in Singapore at AWS, said language and culture-specific LLMs like SEA-LION will enable smoother cross-cultural communication and understanding, preserve cultural nuances, and help governments and businesses better serve citizens and customers in Southeast Asia.
Organisations that are currently testing and deploying SEA-LION include NCS and Tokopedia, which are using it to enhance business operations and foster transformation capabilities.