November 25, 2024

TÜLU 3: Open-Source Post-Training Advances for Language Models

Listen to this article as Podcast
0:00 / 0:00
TÜLU 3: Open-Source Post-Training Advances for Language Models
```html

TÜLU 3: A New Standard in Post-Training Open Language Models

Post-training, the art of eliciting powerful behaviors from a pre-trained language model, has undergone many stages of development since the release of ChatGPT. Early work in post-training language models followed a standard recipe developed by models like InstructGPT, consisting of instruction tuning followed by preference tuning. However, post-training is a complex process. While teaching the model specialized skills, such as programming knowledge, other abilities, like writing poetry or following instructions, can be impaired. Finding the right mix of data and hyperparameters that allow a model to acquire new knowledge and skills without losing its general abilities is a challenge.

Large model developers like OpenAI, Anthropic, Meta, and Google have increased the complexity of post-training approaches, relying on multiple training rounds, human and synthetic data, and multiple training algorithms and objectives. As a result, these models often have specialized knowledge and general capabilities. However, neither their training data nor their training recipes are transparent to users.

Until now, open-source post-training lagged behind that of closed-source models. TÜLU 3, a family of open, state-of-the-art post-trained models, released along with all data, data mixtures, recipes, code, infrastructure, and the evaluation framework, aims to change this. TÜLU 3 pushes the boundaries of research in post-training and closes the performance gap between open and closed fine-tuning recipes. To close this gap, new datasets and new training procedures had to be developed. New methods were introduced to train directly on verifiable problems with reinforcement learning and how to use a model's own generations to create powerful preference data.

TÜLU 3: More Than Just an Artifact

TÜLU 3 is not just an artifact, but a comprehensive collection of data and tools designed to push the boundaries of open post-training. It is a modern post-training stack and completely open-source, with all the code and details needed to replicate the results:

Comprehensive guide to evaluation, decontamination, and recipe design
Scaled, new synthetic instruction datasets
Scaling preference data with on-policy generations
Reinforcement learning with verifiable rewards
A new method that uses RL without a reward model to improve specific skills

By openly sharing the data, recipes, and results, the community is empowered to explore new and innovative post-training approaches.

TÜLU 3: Post-Training for Everyone

With the TÜLU 3 models and recipes, anyone can now train an open-source model for their use case to the quality of leading closed-source models. Developers and AI manufacturers can now use and adapt open-source models to their data without losing core general competencies by using TÜLU 3 data and recipes. TÜLU 3 releases several decontaminated datasets that can be used to train for specific skills and abilities - such as knowledge retrieval, following instructions, logical reasoning, math, programming, and multilingual interactions. TÜLU 3's data can be combined with any skill-specific data. The recipes help balance the datasets. The computational effort required is low. A family of model sizes has been released along with all checkpoints. This means you can choose the desired model size and training stage and either use it immediately or continue training with your own data or the available mixtures.

Comparing language model evaluations is notoriously difficult, as many small details play a role in the evaluation that often cannot be reproduced by other developers. TÜLU 3 provides an evaluation framework that allows developers to set all these settings and easily reproduce all evaluations performed for TÜLU 3.

The Future of Post-Training

The simple separation of pre-training and post-training is blurring and being redesigned with entirely new model styles like OpenAI's o1. Models like DeepSeek R1 lite mimic o1, and this style of using smarter ideas to train language models will continue. This is all due to understanding how powerful post-training and other loss functions like RL can be for final model performance.

Post-training is becoming increasingly important for leading labs. Meta's post-training team consists of about 200 employees - and soon the broader AI ecosystem, e.g., policymakers, will recognize this as well.

Bibliography:

```