The California-based AI company Sesame has released its base model CSM-1B (Conversational Speech Model) as open source under the Apache-2.0 license. This step allows for broad commercial use with minimal restrictions and marks another milestone in the development of freely accessible AI models for speech generation.
CSM-1B is a transformer-based model with one billion parameters. It utilizes a two-stage architecture, consisting of a larger transformer backbone (1-8 billion parameters) for fundamental language processing and a smaller decoder (100-300 million parameters) for audio generation. The model processes speech using semantic tokens for linguistic properties and phonetics, as well as acoustic tokens for sound properties like pitch and stress. This combination allows CSM-1B to integrate nuances of human speech, such as micro-pauses, variations in emphasis, and even laughter, into the generated speech.
One million hours of English audio data were used over five epochs to train the model. CSM-1B can process sequences of up to 2,048 tokens (approximately two minutes of audio) in an end-to-end architecture. This approach differs from traditional text-to-speech systems through its integrated processing of text and audio. Particularly noteworthy is the model's ability to clone voices with just one minute of source data.
In blind tests, participants could not distinguish between speech generated by CSM-1B and that of real humans when listening to short conversational excerpts. Sesame developed its own phonetic benchmarks to measure the model's performance. In listening tests, participants rated the generated speech as equivalent to real recordings when heard without context. However, with context, they still preferred the original.
The decision to release CSM-1B as open source follows the trend of making AI technology accessible to a wider audience. Sesame plans to continue releasing key components of its research under the Apache-2.0 license in the future. In the coming months, both the model size and the training scope are to be expanded, and support is to be extended to over 20 languages. A particular focus lies on the integration of pre-trained language models and the development of fully duplex-capable systems that can learn conversational dynamics such as speaker changes, pauses, and tempo directly from data.
The release of CSM-1B also raises questions about security and ethics. The model's ability to clone voices carries the potential for misuse, such as creating deepfakes or conducting voice phishing attacks. Sesame appeals to developers and users to use the model responsibly and refrain from unauthorized voice cloning, the creation of misleading content, or other harmful activities. The responsibility for the ethical handling of the technology ultimately lies with the users.
The open-source release of CSM-1B by Sesame is a significant step in the development of AI-powered speech generation. The model offers impressive results and opens up new possibilities for applications in various fields. At the same time, the release underscores the need for responsible use and an open discussion about the ethical implications of this technology.
Sources: - https://the-decoder.de/sesame-veroeffentlicht-ki-stimmengenerator-csm-1b-als-open-source/ - https://www.reddit.com/r/singularity/comments/1jb2pnk/sesame_open_sources_their_csm1b_voice_generation/ - https://the-decoder.com/sesame-releases-csm-1b-ai-voice-generator-as-open-source/ - https://github.com/isaiahbjork/csm-voice-cloning - https://autogpt.net/sesame-releases-its-base-ai-model-and-its-open-source/ - https://www.youtube.com/watch?v=ULV6cXgnkAo - https://github.com/SesameAILabs/csm - https://techcrunch.com/2025/03/13/sesame-the-startup-behind-the-viral-virtual-assistant-maya-releases-its-base-ai-model/ - https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice - https://www.threads.net/@luokai/post/DHKNHzJzSFI/sesame-csms-open-source-conversational-speech-model-is-out-this-ones-a-1b-parame