Large language models (LLMs) have made remarkable progress in the field of machine reasoning in recent years, especially in tasks that require logical deduction. However, much of this research has focused on English-language models and datasets. Transferring these capabilities to other languages presents a significant challenge, particularly in the area of multilingual reasoning.
A promising method for improving multilingual reasoning is scaling thinking tokens. This technique aims to enhance the performance of English-centric LLMs in other languages by increasing the number of tokens used for the reasoning process. Simply put, the model is given more "time" to think in order to perform complex reasoning processes in different languages.
Studies show that scaling thinking tokens, especially in models like the so-called "s1-models," can significantly improve performance in multilingual mathematical reasoning. By increasing the number of thinking tokens, these models can process multilingual information more effectively and draw more accurate conclusions.
An interesting aspect of this research is the analysis of language mixing patterns during the reasoning process. It has been observed that LLMs tend to switch between different languages during reasoning, suggesting that they may be drawing on cross-lingual knowledge to solve the task at hand. Scaling thinking tokens influences these language mixing patterns and can lead to more effective use of cross-lingual knowledge.
Furthermore, the choice of thinking frequency, i.e., the number of thinking tokens per step, plays an important role. A higher thinking frequency can lead to more accurate processing of information, while a lower thinking frequency reduces the computational cost. The optimal thinking frequency depends on the complexity of the task and the characteristics of the respective language model.
Another important area of research is cross-domain generalization. Can an LLM trained on mathematical reasoning transfer its skills to other areas such as social sciences or cultural issues? Initial results suggest that scaling thinking tokens can also lead to improved performance here. By increasing thinking tokens, LLMs may be able to better grasp abstract concepts and apply them to different domains.
Scaling thinking tokens is a promising approach to improving multilingual reasoning in LLMs. Future research should focus on optimizing scaling strategies, investigating language mixing patterns, and improving cross-domain generalization. This research could contribute to the development of more powerful and versatile multilingual LLMs that can be used in a variety of applications.
Bibliography: - https://www.arxiv.org/abs/2505.05408 - https://twitter.com/_akhaliq/status/1920753842431435159 - https://x.com/_akhaliq?lang=de - https://arxiv.org/html/2504.02890v1 - https://ritvik19.github.io/papers-explained/ - https://www.researchgate.net/publication/390545493_Scaling_Test-time_Compute_for_Low-resource_Languages_Multilingual_Reasoning_in_LLMs - https://www.mdpi.com/2076-3417/14/5/2074 - https://huggingface.co/papers?q=supervised-finetuning%20baseline - https://www.researchgate.net/publication/357201674_Few-shot_Learning_with_Multilingual_Language_Models