Long-context language models (LCLMs) open up new possibilities in AI-powered text processing. From summarizing extensive documents to learning new tasks on the fly, they promise to fundamentally change the way we interact with language. Recent advancements allow for context windows of up to millions of tokens, in contrast to the previous limitations of 2,000 to 8,000 tokens in models like ChatGPT or Llama-2/3.
However, evaluating these powerful models presents a challenge. Existing benchmarks are often unsuitable for evaluating LCLMs with their significantly longer context windows. Perplexity and synthetic tasks like "Needle-in-a-Haystack" (NIAH) have become established metrics, but often don't correlate with real-world performance. The use of different datasets by model developers also makes it difficult to compare different models. Existing benchmarks often deliver contradictory results that inadequately reveal the strengths and weaknesses of individual models.
HELMET (Holistically Evaluating Long-context Language Models) addresses these challenges and offers a comprehensive benchmark for evaluating LCLMs. The focus is on diversity, controllability, and reliability of the evaluation. HELMET's analysis of 59 current LCLMs highlights the need for evaluation across various use cases to understand the models' capabilities. The results also show that even state-of-the-art LCLMs still reach their limits with complex tasks.
The common practice of evaluating LCLMs based on perplexity or synthetic tasks like NIAH reaches its limits. Studies show that perplexity only weakly correlates with actual performance. Even simple synthetic tasks like NIAH do not adequately reflect real-world performance. More complex synthetic tasks show a higher correlation but do not cover the entire spectrum of LCLM capabilities.
Existing benchmarks with realistic applications, such as ZeroScrolls, LongBench, and InfiniteBench, also have limitations. They often don't offer sufficient coverage of different tasks, are limited to specific areas, use insufficient text lengths for evaluating current LCLMs, and rely on unreliable metrics like ROUGE, which correlate poorly with human judgments.
HELMET was developed with the goal of closing these gaps and enabling a comprehensive evaluation of LCLMs. The benchmark includes various tasks such as retrieval-augmented generation, generation with citations, and summarization. The selected datasets are characterized by naturally long contexts that reflect real-world applications. The evaluation is carried out using model-based evaluations and human studies.
HELMET allows for control of text length and complexity. The input length can be varied by the number of retrieved passages, the number of demonstrations, or the length of the input document. This allows for a targeted investigation of model performance at different context lengths.
Instead of N-gram-based metrics like ROUGE, HELMET uses model-based evaluations that allow for better differentiation between models and different input lengths. Human studies confirm the high agreement of these metrics with human judgments.
HELMET supports both base models and fine-tuned models. In-context learning examples improve the performance of base models and make the evaluation more realistic.
The HELMET experiments encompass 59 LCLMs, including leading proprietary and open-source models with different architectures and position extrapolation techniques. The results show that diversified evaluation is essential to comprehensively assess the capabilities of LCLMs. The performance of the models varies greatly depending on the task type and does not always correlate between different categories.
The performance of the models decreases with increasing text length and task complexity. Even the most advanced models show a significant drop in performance with complex tasks. Open-source models lag behind proprietary models, especially in demanding tasks. There is no clear winner across all categories, which underscores the need for evaluation across different dimensions.
HELMET offers a comprehensive evaluation platform for LCLMs and facilitates comparison with existing models. The results of the 59 evaluated models serve as a reference point for future developments. For fast iterations during model development, the recall and RAG tasks are recommended, which offer a good balance between fast evaluation and correlation with other realistic tasks.