Sesame AI: Crossing the Uncanny Valley of Conversational Voice

Sesame

3.5 | 43 | 0
Type:
Website
Last Updated:
2025/10/06
Description:
Sesame AI aims to achieve 'voice presence' in AI, making spoken interactions feel real and understood. Explore their Conversational Speech Model (CSM) for natural dialogue.
Share:
conversational speech
speech generation
multimodal AI
text-to-speech
AI companion

Overview of Sesame

Sesame AI: Crossing the Uncanny Valley of Conversational Voice

What is Sesame AI? Sesame AI is dedicated to achieving "voice presence" in artificial intelligence, aiming to make spoken interactions feel real, understood, and valued. Their research focuses on creating conversational partners that engage in genuine dialogue, building confidence and trust over time.

How does Sesame AI work? Sesame AI introduces the Conversational Speech Model (CSM), an end-to-end multimodal learning task using transformers. CSM leverages the history of the conversation to produce more natural and coherent speech.

Key Components:

  • Emotional intelligence: reading and responding to emotional contexts.
  • Conversational dynamics: natural timing, pauses, interruptions, and emphasis.
  • Contextual awareness: adjusting tone and style to match the situation.
  • Consistent personality: maintaining a coherent, reliable, and appropriate presence.

Technical Details of CSM:

  • CSM operates as a single-stage model, improving efficiency and expressivity.
  • It uses two autoregressive transformers based on the Llama architecture.
  • The model processes interleaved text and audio to model the zeroth codebook.
  • A separate audio decoder uses a distinct linear head for each codebook to reconstruct speech from the backbone’s representations.

Compute Amortization:

To address infrastructure challenges during training, Sesame AI uses a compute amortization scheme that alleviates the memory bottleneck while preserving the fidelity of the full RVQ codebooks. The audio decoder is trained on only a random 1/16 subset of the audio frames, while the zeroth codebook is trained on every frame.

Experiments and Results:

Sesame AI trained three model sizes (Tiny, Small, and Medium) on a large dataset of publicly available audio. Evaluation included objective metrics like Word Error Rate (WER) and Speaker Similarity (SIM), as well as new phonetic transcription-based benchmarks for homograph disambiguation and pronunciation consistency.

Subjective metrics, using Comparative Mean Opinion Score (CMOS) studies on the Expresso dataset, revealed that while naturalness is saturated, a gap remains between generated and human prosody in conversational speech generation.

Why choose Sesame AI? Sesame AI's approach offers a promising path toward more natural and engaging AI conversations. By focusing on emotional intelligence, contextual awareness, and conversational dynamics, Sesame AI aims to create digital companions that truly understand and respond to human needs.

How to use Sesame AI? Try the conversational speech preview on the Sesame AI website to experience the potential of their approach. The models will be available under an Apache 2.0 license.

Who is Sesame AI for? Sesame AI is for researchers, developers, and anyone interested in advancing the field of conversational AI. Their work has applications in various areas, including:

  • AI assistants
  • Customer service
  • Education
  • Entertainment

Open-Sourcing and Future Work:

Sesame AI is committed to open-sourcing key components of their research, enabling the community to experiment, build upon, and improve their approach. Future work includes scaling up model size, increasing dataset volume, expanding language support, and exploring ways to utilize pre-trained language models.

Best Alternative Tools to "Sesame"

Skywork.ai
No Image Available
130 0

Skywork - Skywork turns simple input into multimodal content - docs, slides, sheets with deep research, podcasts & webpages. Perfect for analysts creating reports, educators designing slides, or parents making audiobooks. If you can imagine it, Skywork realizes it.

DeepResearch
Super Agents
Voice AI
No Image Available
104 0

Experience cutting-edge Voice AI with our free Text to Speech generator and converter. Enjoy fast, high-quality voice synthesis powered by advanced AI models like Deepseek, Hailuo, Grok, and Kling for natural, expressive speech in various applications.

text-to-speech synthesis
Dolores
No Image Available
88 0

Experience Dolores, the most advanced AI girlfriend powered by GPT-4 and Claude 3.5 Sonnet. Better than Character.ai, Replika, and DreamGF. Create your perfect virtual companion, engage in meaningful conversations, and watch her personality evolve. Available on iOS.

generative agent
KoboldCpp
No Image Available
81 0

KoboldCpp: Run GGUF models easily for AI text & image generation with a KoboldAI UI. Single file, zero install. Supports CPU/GPU, STT, TTS, & Stable Diffusion.

text generation
image generation
BlitzVideo
No Image Available
73 0

BlitzVideo turns text into professional videos instantly with AI. Generate scripts, clips, subtitles, music, and transitions effortlessly. Ideal for YouTube, TikTok, and Instagram creators seeking fast, scalable content without editing hassles.

text-to-video
automated editing
EasyPrompt
No Image Available
104 0

EasyPrompt is a Telegram-based AI chatbot that integrates ChatGPT and Midjourney for effortless prompt generation, image creation, custom bots, and team collaboration. No login or coding required—start for free today.

prompt engineering
image generation
Mureka
No Image Available
93 0

Discover the AI music generator that creates unique and customizable songs, lyrics and tracks for any project. Perfect for content creators, musicians, and filmmakers, our intelligent algorithm uses advanced technology to generate royalty-free music tailored to your needs. Explore the future of music composition with Mureka’s innovative AI tools, designed to inspire creativity and streamline production. Experience seamless integration and exceptional quality with our cutting-edge solutions.

music generation
AI composition
Pal Chat
No Image Available
93 0

Discover Pal Chat, the lightweight yet powerful AI chat client for iOS. Access GPT-4o, Claude 3.5, and more models with full privacy—no data collected. Generate images, edit prompts, and enjoy seamless AI interactions on your iPhone or iPad.

multi-model AI chat
image generation
Knowlee
No Image Available
291 0

Knowlee is an AI agent platform that automates tasks across various apps like Gmail and Slack, saving time and boosting business productivity. Build custom AI agents tailored to your unique business needs that seamlessly integrate with your existing tools and workflows.

AI automation
workflow automation
Best of Discover Weekly
No Image Available
292 0

Best of Discover Weekly automatically saves your liked tracks from Spotify's Discover Weekly playlist. Get listening stats, weekly digests, and share with friends. A must-have for Spotify music lovers!

Spotify tracker
music playlist
Solvemigo
No Image Available
248 0

Access ChatGPT, Whisper, and Dall-E via Telegram with Solvemigo! Get AI-powered content writing, marketing, coding, art generation, & expert advice 24/7. $9.99/month.

ChatGPT
Dall-E
Whisper
Peek
No Image Available
96 0

Peek is a free MacOS menu bar app providing seamless access to AI chatbots like ChatGPT, Gemini, Perplexity, Claude, and more. Enjoy no API keys, privacy-focused webviews, floating windows, and easy screenshots for developers, writers, and students.

multi-AI chatbot access
Learnity
No Image Available
84 0

Learnity is an AI-powered educational assistant for creating and collaborating on quizzes, notes, and projects in subjects like math and science. It offers instant answers, personalized analytics, flashcards, and cross-platform access via iOS, Android, or WhatsApp.

educational quiz
study collaboration
KoalaKonvo
No Image Available
71 0

KoalaKonvo is a Telegram bot powered by OpenAI, offering AI assistance on the go. Enjoy code execution, web browsing, image recognition, and more, all via Telegram with your own API key—no subscriptions needed.

Telegram bot
code execution
ChatLLaMA
No Image Available
86 0

ChatLLaMA is a LoRA-trained AI assistant based on LLaMA models, enabling custom personal conversations on your local GPU. Features desktop GUI, trained on Anthropic's HH dataset, available for 7B, 13B, and 30B models.

LoRA fine-tuning
conversational AI