Last week we released our speech-to-text stack. This week, we’re going the other direction. Today we’re releasing a text-to-speech stack for R: native inference, API wrappers, Docker containers, and a Shiny app.

The full stack

We’re releasing five components that together allow you to do TTS in R:

chatterbox - Native R torch implementation of Resemble AI’s Chatterbox model
tts.api - Unified R interface for multiple TTS backends
cornfab - Shiny app for point-and-click speech generation
chatterbox-tts-api - Docker container for Chatterbox
qwen3-tts-api - Docker container for Qwen3-TTS

Why so many pieces? Because no single deployment model works everywhere. Sometimes you want everything running locally in R so you can inspect tensors and debug behavior. Sometimes you want a container that stays warm in VRAM. Sometimes you just want to call an API and move on.

The goal is, again, to give us options without drama.

chatterbox: native R torch TTS

The chatterbox package is an R implementation of Resemble AI’s Chatterbox model. Like our whisper package, it’s built entirely using torch for R. No Python.

1library(chatterbox)
2
3model <- chatterbox("cuda")
4model <- load_chatterbox(model)
5
6result <- generate(model, "Hello from R!", voice = "reference.wav")
7write_audio(result$audio, result$sample_rate, "hello.wav")

Voice cloning works out of the box. Give it a few seconds of reference audio, and it’ll generate speech in that voice. The model handles prosody, emotion, and natural speech patterns surprisingly well.

Is it as fast as the Python implementation? Not quite—about 3x slower in float32. But with traced inference enabled (the default), you get JIT-compiled execution that closes much of that gap. Exploring that gap tells us where R’s torch needs to, literally, get up to speed. You also get something the Python version doesn’t offer: the ability to read and analyze through the entire inference pipeline in R.

For the curious, the implementation includes the full architecture: S3Gen encoder, T3 transformer decoder, CFG-based generation, etc. It’s all there in R if you want to understand how modern TTS actually works.

tts.api: one function, many backends

Not everyone needs native inference. Sometimes you just want speech and don’t care how it happens.

tts.api provides a single tts() function that routes to whatever backend makes sense:

1library(tts.api)
2
3# Auto mode - uses native if available, else containers/APIs
4tts("Hello world", voice = "reference.wav", file = "hello.wav")
5
6# Explicit backends
7tts("Hello", voice = "reference.wav", file = "out.wav", backend = "native")
8tts("Hello", voice = "Vivian", file = "out.wav", backend = "qwen3")
9tts("Hello", voice = "nova", file = "out.wav", backend = "openai")

Supported backends:

native - R chatterbox package (no Docker needed)
chatterbox - Local Chatterbox container
qwen3 - Local Qwen3-TTS container (9 voices, 10 languages, voice design)
openai - OpenAI TTS API
elevenlabs - ElevenLabs API

Again, the goal is optionality without ceremony. The code can work whether you’re on a laptop, a GPU server, or just using an API key.

The containers

For production use or when you want pre-loaded models and fast inference, we’re releasing two Docker containers.

chatterbox-tts-api

Our fork of Resemble AI’s Chatterbox with an OpenAI-compatible API:

1docker run -d --gpus all --network=host \
2  -v ~/.cache/huggingface:/root/.cache/huggingface \
3  -e PORT=7810 \
4  chatterbox-tts-api

Voice cloning, exaggeration control, and ~2.2 seconds for 6 seconds of audio once the model is loaded and warm in VRAM.

qwen3-tts-api

Alibaba’s Qwen3-TTS is impressive. Nine built-in voices, ten languages, and two features that Chatterbox doesn’t have: voice design and voice cloning from just 3 seconds of audio. ~~We’ll be adding voice design next week, but didn’t want to wait any longer to put all this out there.~~ Voice design is now available via speech_design() in tts.api.

1docker run -d --gpus all --network=host \
2  -v ~/.cache/huggingface:/cache \
3  -e PORT=7811 \
4  qwen3-tts-api:blackwell

Voice design lets you describe the voice you want in natural language:

1speech_design(
2  "Welcome to our podcast",
3  voice_description = "A warm, professional female voice with slight enthusiasm",
4  file = "intro.wav"
5)

Both containers support Blackwell GPUs (RTX 50xx) with the appropriate Dockerfile.

cornfab: when you don’t want to write code

cornfab is a Shiny app that puts all of this behind a simple UI. Select a backend, pick a voice, type some text, click generate.

1library(cornfab)
2run_app()  # Runs on port 7803

It supports all the backends, voice uploads for cloning, parameter tweaking, and persistent history. Custom voices are stored in ~/.cornfab/voices/ (but only after you approve! …in the CRAN compliant fashion ;) and work across all backends that support cloning.

One note: Right now, cornfab doesn’t auto-start containers. You’ll need to start them yourself before selecting those backends. That feature is coming, but needs more testing.

Getting started

1# Native R TTS
2remotes::install_github("cornball-ai/chatterbox")
3
4# Unified API
5remotes::install_github("cornball-ai/tts.api")
6
7# Shiny app
8remotes::install_github("cornball-ai/cornfab")

Then pick your entry point:

chatterbox::generate() if you want full control and transparency
tts.api::tts() if you want flexibility across backends
cornfab::run_app() if you want a UI

All packages are MIT licensed. Contributions welcome.

What’s next

There’s still a lot to tighten up. Performance gaps between R and Python implementations are visible, and that’s useful data for improving torch in R. ~~Voice design support for Qwen3-TTS is landing next, and~~ Container auto-startup from inside cornfab needs more testing before I trust it.

For now, this is an MVP that’s already useful. As with the speech-to-text stack, the larger goal is simple: make generative AI something R users can run, read, and modify.

Text-to-Speech in R