Sesame AI: Crossing the Uncanny Valley of Conversational Voice

Sesame

3.5 | 315 | 0
Type:
Website
Last Updated:
2025/10/06
Description:
Sesame AI aims to achieve 'voice presence' in AI, making spoken interactions feel real and understood. Explore their Conversational Speech Model (CSM) for natural dialogue.
Share:
conversational speech
speech generation
multimodal AI
text-to-speech
AI companion

Overview of Sesame

Sesame AI: Crossing the Uncanny Valley of Conversational Voice

What is Sesame AI? Sesame AI is dedicated to achieving "voice presence" in artificial intelligence, aiming to make spoken interactions feel real, understood, and valued. Their research focuses on creating conversational partners that engage in genuine dialogue, building confidence and trust over time.

How does Sesame AI work? Sesame AI introduces the Conversational Speech Model (CSM), an end-to-end multimodal learning task using transformers. CSM leverages the history of the conversation to produce more natural and coherent speech.

Key Components:

  • Emotional intelligence: reading and responding to emotional contexts.
  • Conversational dynamics: natural timing, pauses, interruptions, and emphasis.
  • Contextual awareness: adjusting tone and style to match the situation.
  • Consistent personality: maintaining a coherent, reliable, and appropriate presence.

Technical Details of CSM:

  • CSM operates as a single-stage model, improving efficiency and expressivity.
  • It uses two autoregressive transformers based on the Llama architecture.
  • The model processes interleaved text and audio to model the zeroth codebook.
  • A separate audio decoder uses a distinct linear head for each codebook to reconstruct speech from the backbone’s representations.

Compute Amortization:

To address infrastructure challenges during training, Sesame AI uses a compute amortization scheme that alleviates the memory bottleneck while preserving the fidelity of the full RVQ codebooks. The audio decoder is trained on only a random 1/16 subset of the audio frames, while the zeroth codebook is trained on every frame.

Experiments and Results:

Sesame AI trained three model sizes (Tiny, Small, and Medium) on a large dataset of publicly available audio. Evaluation included objective metrics like Word Error Rate (WER) and Speaker Similarity (SIM), as well as new phonetic transcription-based benchmarks for homograph disambiguation and pronunciation consistency.

Subjective metrics, using Comparative Mean Opinion Score (CMOS) studies on the Expresso dataset, revealed that while naturalness is saturated, a gap remains between generated and human prosody in conversational speech generation.

Why choose Sesame AI? Sesame AI's approach offers a promising path toward more natural and engaging AI conversations. By focusing on emotional intelligence, contextual awareness, and conversational dynamics, Sesame AI aims to create digital companions that truly understand and respond to human needs.

How to use Sesame AI? Try the conversational speech preview on the Sesame AI website to experience the potential of their approach. The models will be available under an Apache 2.0 license.

Who is Sesame AI for? Sesame AI is for researchers, developers, and anyone interested in advancing the field of conversational AI. Their work has applications in various areas, including:

  • AI assistants
  • Customer service
  • Education
  • Entertainment

Open-Sourcing and Future Work:

Sesame AI is committed to open-sourcing key components of their research, enabling the community to experiment, build upon, and improve their approach. Future work includes scaling up model size, increasing dataset volume, expanding language support, and exploring ways to utilize pre-trained language models.

Best Alternative Tools to "Sesame"

Valossa
No Image Available
309 0

Valossa is an AI-powered video analysis platform that converts video to text, enabling search, caption generation, and highlight clipping. It automates video workflows, saving time and resources.

video transcription
mistral.rs
No Image Available
416 0

mistral.rs is a blazingly fast LLM inference engine written in Rust, supporting multimodal workflows and quantization. Offers Rust, Python, and OpenAI-compatible HTTP server APIs.

LLM inference engine
Rust
ChatGPT
No Image Available
219 0

ChatGPT is OpenAI's conversational AI system that helps with writing, learning, brainstorming, and productivity through natural language interactions.

conversational AI
writing assistant
DaveAI
No Image Available
178 0

DaveAI is a Conversational Experience Cloud using AI agents, avatars, and visualizations to personalize customer journeys and boost engagement across web, kiosks, WhatsApp, and edge deployments.

Conversational AI
AI Agents
Twinning
No Image Available
416 0

Twinning empowers influencers to create personalized AI twins for fan chats via text and audio. Join the waitlist, record a short audio, and start monetizing interactions with no monthly fees.

AI twin creation
voice cloning
Pal Chat
No Image Available
336 0

Discover Pal Chat, the lightweight yet powerful AI chat client for iOS. Access GPT-4o, Claude 3.5, and more models with full privacy—no data collected. Generate images, edit prompts, and enjoy seamless AI interactions on your iPhone or iPad.

multi-model AI chat
image generation
Famulor
No Image Available
684 0

Famulor is a leading AI phone assistant that automates your business calls with human-like, intelligent AI agents available 24/7. GDPR compliant and hosted in the EU.

AI call center
virtual assistant
Orga AI
No Image Available
310 0

Orga AI is a conversational and multimodal AI platform for businesses, enhancing customer service and boosting productivity with human-like interactions.

conversational AI
multimodal agents
Bearly AI
No Image Available
526 0

Bearly AI is a private AI chat platform that offers the power of ChatGPT with complete privacy protection. Works with OpenAI, Anthropic, Gemini, and Grok.

AI Chat
Privacy
Security
InstaLM
No Image Available
297 0

InstaLM: Chat with Claude, GPT, Gemini & more directly on your macOS & iOS device. Enjoy voice interaction, file attachments & custom assistants with a privacy-first design.

AI chat app
AI assistant
Google Gemini
No Image Available
336 0

Google Gemini is a multimodal AI assistant that integrates with Google's ecosystem to provide advanced writing assistance, planning, brainstorming, and productivity tools through text, voice, and visual interactions.

multimodal AI
Google assistant
AI Voice Generator
No Image Available
231 0

AI Voice Generator is a tool that transforms text into natural-sounding voices. It offers voice cloning, text-to-speech, sound effects, and dialogue generation, trusted by over 10,000 creators.

text to speech
voice cloning
Cognigy
No Image Available
387 0

Cognigy provides generative AI-powered customer service agents for enterprises, offering voice and chat solutions with high automation rates and real-time translation capabilities.

conversational AI
Fotol AI
No Image Available
397 0

Fotol AI provides a gateway to AGI, offering powerful AI solutions for video, image, speech, music, 3D asset generation, and conversation. Dream it, make it!

AI video
AI image
AI music