Large Language Models in 2025 and Beyond

By Jon Atle Gulla, Professor, Director of NorwAI

NorwAI's Norwegian large language models were launched in May 2024 and have been downloaded and tested by organizations across the country. The feedback has generally been positive and encouraging, but several have commented that they generally cannot replace the large international models. This was of course not the intention when we started training Norwegian models either. Some have also questioned which training data we have used and how we have balanced the various sources, and this is a debate that is both important and difficult and which we now carry forward in the plans for 2025.

Large Language Models in 2025 and Beyond

The MIMIR project, a collaborative project between the National Library, the LTG group at the University of Oslo, and NorwAI, investigated the importance of high-quality Norwegian textual data for the training of Norwegian language models. The resulting models compared well with the international ones, but perhaps the most interesting and promising aspect was that the collaboration between the three parties worked so well.

A national center

Subsequently, the three partners have submitted an application to establish a national center for language modeling, and the National Library has gradually taken on a large and important role in the management of Norwegian text for training large language models. The major players in language models in Norway are now coming together to ensure that we have access to good Norwegian language models, where copyright issues are properly handled, and the models can be run locally and used in critical Norwegian applications. We hope that this center will be funded and expect that many of the Norwegian generic models will be trained here going forward.

Our new focus

Jon Atle Gulla speaking on stage — Director at NorwAI Jon Atle Gulla points out four exciting sectors for NorLLM to develop domain-adapted language models.

For NorwAI, this means that we can focus more on adapting language models and using the models in new innovative solutions. Specifically, we are now planning the following development paths for 2025:

A new generic language model of moderate size. We have seen that some of the training data in the published models were not as free from copyrighted content as we had hoped. Therefore, we want to train and publish a new model with a clean dataset and even more quality content from media houses. We have not yet concluded which architecture or size will be the basis for the model, but we realize that our partners and external parties find it useful if the model is small enough to run locally.
At the same time, we are planning several domain-adapted language models. This requires close collaboration with domain experts both in collecting training data and evaluating the models in interesting applications. Along the way, a methodology will be defined that can later be used in other domains and applications than those we have identified so far:
- Health: We have for quite some time been in dialogue with various actors in the health sector to train language models for specific health applications. Experiments in the sector have shown that international language models do not sufficiently understand the professional terminology used. Our work is well underway, and we plan to have the first fine-tuned model ready by summer for further adaptation and testing.
- Construction: In the construction industry, legislation and standards are important and vary somewhat from country to country. There is now a large project on project management in the construction industry, and new projects are planned to fine-tune construction-specific models that can help the industry comply with regulations and adhere to relevant standards in contracts and project documentation.
- Finance: NorwAI has long considered creating its own finance language models. New resources will now be hired to follow up on this, and we will begin the work of mapping needs and planning the training itself in 2025.
- Ocean and Marine Sector: This is a sector that is very important for the Norwegian society, and we are involved in several center applications that include training and fine-tuning language models for managing ocean data and supporting maritime activities. It is still somewhat unclear what can be initiated in 2025, but the goal is to establish good cooperation with the sector and lay out a plan for how multimodal models can be trained and utilized in a domain where the data are extremely complex, fragmented, and difficult to collect.

Flow chart — The four steps for domain-adapted LLMs

As the figure above shows, continual pre-training and fine-tuning with instruction data are central to the work ahead. Alignment will only be addressed to a limited extent by NorwAI, as this requires resources that are currently not available to the center.

An API test period

Several smaller companies and public agencies have requested easier access to the models than what they get through Huggingface. They are asking for APIs and encouraging us to host the models for those who, for various reasons, cannot run them locally themselves. This is something we have deliberately avoided and considered outside NorwAI's scope, even though it makes it difficult for small and resource-poor organizations to use the models. Now we are starting discussions with our partners to see if we can find a solution that makes the models available via an API during a pilot period. NorwAI neither has the mandate nor the resources to permanently offer language models via APIs to the Norwegian society, but it is interesting to explore both technical possibilities and market needs, so that permanent hosting services may be developed in collaboration with other actors at a later stage.

An audience listening on a talk from a stage — Wrapping up the NorwAi Innovate conference, Professor Jon Atle Gulla also outline the next steps.

2025-04-04

2025-04-04