The OpenGPT-X research project has released a large language model with seven billion parameters that supports all 24 official languages of the European Union. The model, named "Teuken-7B," is open source and available for download on the Hugging Face platform. Unlike most AI language models, which are primarily trained on English, Teuken-7B was developed from the ground up with a significantly higher proportion of non-English European languages. Approximately half of the training data comes from these languages. The developers emphasize that the model functions reliably in all 24 languages.
The development of Teuken-7B took place within the framework of the OpenGPT-X project, a consortium funded by the German Federal Ministry for Economic Affairs and Climate Action (BMWK) and led by the Fraunhofer Institutes IAIS and IIS. The project aims to create a European alternative to the dominant, mostly US-American, language models. A central aspect is the preservation of data sovereignty. The open-source license allows companies and organizations to adapt the model and operate it locally without having to transfer sensitive data to external providers.
A particular innovation of OpenGPT-X is the multilingual tokenizer. A tokenizer breaks down words into individual components, called tokens. The fewer tokens a text contains, the more efficiently the language model can process it. The tokenizer developed by OpenGPT-X is specifically optimized for the 24 EU languages and enables significantly more efficient and cost-effective training compared to conventional tokenizers, which are primarily geared towards English. This is particularly relevant for European languages with complex word structures, such as German or Finnish.
The training of Teuken-7B took place on the JUWELS supercomputer at the Jülich Research Centre, one of the most powerful computers in Germany. The knowledge gained in the project is also flowing into the development of the European exascale supercomputer JUPITER, which is being built at the Jülich Research Centre and is expected to offer significantly higher computing power for the development of complex AI models starting next year.
Teuken-7B is available in two versions: one for purely research purposes and one under the Apache-2.0 license, which also permits commercial use and integration into proprietary applications. The performance of both versions is comparable. However, some of the datasets used for "instruction tuning" were excluded from the commercial version because their license terms do not permit commercial use. Instruction tuning serves to train the model to correctly interpret user instructions, which is important for chat applications, for example.
To evaluate the performance of Teuken-7B, the OpenGPT-X team developed the "European LLM Leaderboard," a ranking that compares language models in almost all official EU languages based on various benchmarks. For this purpose, established benchmarks were translated into 20 EU languages. The leaderboard enables a comprehensive assessment of the multilingual capabilities of language models and thus goes beyond the usual tests, which are mostly limited to English.
The OpenGPT-X project runs until March 2025. During this time, further optimizations and evaluations of the model are planned. The consortium encourages developers and researchers to use, adapt, and further develop the model to strengthen the European AI landscape. Interested parties can exchange ideas and contact the development team via a dedicated Discord server.