Introduction
In December 2024, the Portuguese government made waves by announcing a €5.5 million investment in AMÁLIA: a large-scale Language Model (LLM) for European Portuguese. This project could redefine the future of natural language processing (NLP) for this language. So, what is AMÁLIA, and why is it such a crucial project?
What is AMÁLIA?
AMÁLIA is a language model specifically designed to treat European Portuguese as a first-class citizen in the LLM space. The project emerged from a collaboration between several top-tier Portuguese universities and research labs, including NOVA, IST, IT, and FCT.
Contrary to expectations, AMÁLIA was not trained from scratch. It is a continuation of the pre-training phase of the EuroLLM model, with slight modifications in context length and RoPE scaling.
Focus on European Portuguese
The core of the AMÁLIA project lies in increasing the share of European Portuguese data at every training stage. During pre-training, Arquivo.pt data was used. During the supervised fine-tuning (SFT), Portuguese data was synthetically generated, and during preference training, some data from the SFT phase was subsampled.
Measuring Effectiveness
To assess the model's quality, the team created four new benchmarks specific to European Portuguese, the most notable being ALBA. These benchmarks are crucial for understanding whether the model can meet the expectations and specific needs of Lusophone users.
Open Source: Limited Transparency
While the goal is to have an open-source model, AMÁLIA is not yet fully open. To date, the model weights, data, training logs, and new benchmarks are not entirely publicly accessible. This stands in stark contrast to Olmo, a benchmark for openness.
Potential Impact and Challenges
The success of AMÁLIA could have significant implications for research and technological development in European Portuguese, offering more precise and tailored NLP solutions. However, the project must overcome challenges in openness and collaboration to maximize its impact.
Conclusion
AMÁLIA represents a significant advance for European Portuguese in the world of LLMs. Navigating between technical challenges and aspirations for openness, this project could redefine the standards of natural language processing for minority languages. Let's discuss your project in 15 minutes.