Voice communication is our most intimate medium as humans, carrying meaning not just through words but through countless variations in tone, pitch, rhythm, and emotion. Unfortunately, today’s voice assistants remain trapped in an ‘uncanny valley’ of flat, emotionless speech that quickly becomes exhausting after the initial novelty wears off. The team at Sesame, led by Brendan Iribe and Ankit Kumar, has been tackling this challenge by developing what they call ‘voice presence’ – the quality that makes spoken interactions feel genuine, understood, and valued. Their goal isn’t just processing requests but creating true conversational partners that build trust over time. Their approach centers around four key components: 1. Emotional intelligence – reading and responding appropriately to emotional contexts 2. Conversational dynamics – natural timing, pauses, and emphasis 3. Contextual awareness – adjusting tone and style to match the situation 4. Consistent personality – maintaining a coherent and reliable presence To address these needs, they’ve developed the Conversational Speech Model (CSM), which frames speech generation as an end-to-end multimodal learning task using transformers. Unlike traditional text-to-speech models, CSM leverages the history of the conversation to produce more natural and coherent speech. Traditional speech generation approaches suffer from what researchers call the ‘one-to-many problem’ – there are countless valid ways to speak a sentence, but only some fit a given context. Without considering tone, rhythm, and conversation history, models lack the information to make the right choice. CSM tackles this through a novel technical approach: – It operates as a single-stage model for improved efficiency and expressivity – It uses a multimodal, text-and-speech architecture that directly operates on both text input and audio tokens – It splits processing between two transformers – a multimodal backbone for semantic processing and an audio decoder for acoustic details This approach helps overcome the limitations of previous methods that relied on a bottleneck of semantic tokens which couldn’t fully capture prosody, or required multiple sequential steps that introduced latency. The results are promising. In subjective evaluations where listeners compared CSM-generated speech to human recordings without conversational context, evaluators showed no clear preference – suggesting the model’s speech sounds remarkably natural. However, when context was included, humans still preferred the original recordings, indicating that while CSM has advanced significantly, crossing the uncanny valley of conversational voice remains a work in progress. What’s particularly exciting about Sesame’s work is their commitment to open-sourcing key components under an Apache 2.0 license, enabling collaborative improvement. The team acknowledges several current limitations, including primarily English-language training data and lack of integration with pre-trained language models. Their roadmap includes scaling up model size, expanding language support to over 20 languages, and exploring ways to utilize pre-trained language models. The ultimate goal is moving toward fully duplex models that implicitly learn conversation dynamics from data – a development that will require fundamental changes across the entire AI stack. This research represents a significant step toward more natural, emotionally intelligent voice assistants. By addressing the uncanny valley of conversational voice, Sesame is helping create AI companions that can engage in genuinely meaningful dialogue rather than just responding to commands with synthetic-sounding speech. For those interested in the technical details, I recommend checking out the Sesame GitHub where they’ll be releasing their models and code. If you’re passionate about voice technology, this is definitely a project worth following.