speech-to-text is not the most exciting thing in the world of AI these days, but today we’re announcing the release of three R packages that together provide a complete speech-to-text solution using just R. No APIs, no Python, no binaries.

Why bother with native R at all?

Why STT in R? Because at cornball.ai, we’ve also been overwhelmed by the change in our coding and productivity using CLI coding tools like Claude Code, Codex, and OpenCode (we’ve been using Claude Code). After a few weeks of building piece after piece of R software, some long on the roadmap and others just a sidequest to see why not, we decided it’s passed time that we start sharing.

I hold strong that R users deserve first-class generative AI, not just wrappers, glue code, or API clients bolted onto someone else’s ecosystem. R users have a lot to bring to the world of AI research and applications, and there’s no good reason we should be working on the periphery while the interesting model work happens elsewhere.

That said, if speed is your primary concern, whisper.cpp exists and so does an R wrapper audio.whisper, and they’re excellent… although audio.whisper is not yet on CRAN. The process of submitting our packages to CRAN… Hold on, let me implement streaming in the app that I’m about to tell you about so I can do the dishes while I write this ;) Okay thanks I’m back. Update: whisper is now on CRAN! You can install it with install.packages("whisper").

After 15+ years of preferring R to Python, I just read R better. Maybe it’s because Python is deeply object-oriented, and that style simply doesn’t match how I, and many other ML/data science/research-oriented R users, think about problems. Add to that the occasional Python strangulation (i.e. dependency hell resulting in the occasional destruction of Ubuntu) and the prospect of an R (and hopefully soon, a CRAN-native) approach becomes obvious. Moreover, local inference is private, cheaper, and more flexible; pick all 3!

This MVP for STT will serve as a blueprint for more of our forthcoming generative model-related work in R. Hopefully it’s accessible enough and good enough to be genuinely useful.

To that end, today we’re releasing three packages that together form a complete speech-to-text stack for R: whisper, stt.api, and earshot. They cover everything from local, native inference to a point-and-click Shiny app. All in R.

The shape of the problem

You want transcription in R. Either you want to know how it works or you need it a for another Shiny app. You want the warm-fuzzy that it’s local and is going to keep working for a long time. Or the ease of using an API key that you’re already using.

STT isn’t new, but I wanted something that behaved like an R package should: installable with standard tooling, debuggable in R, and flexible enough to serve as both a research tool and a practical utility.

whisper: a native R implementation

The whisper package is a pure R implementation of OpenAI’s Whisper model, built on top of the torch package. No Python. No system calls. No background processes.

1library(whisper)
2
3result <- transcribe("recording.wav", model = "small")
4result$text

Under the hood, this includes the full pipeline: audio preprocessing, mel spectrograms, the encoder-decoder transformer, tokenization, decoding, timestamps, and translation. If you’ve ever wondered how Whisper actually works, you can now read through it as R code.

Performance is reasonable, and GPU acceleration works automatically when CUDA is available. Is it as fast as a hand-tuned C++ implementation? No. But speed isn’t the only thing that matters.

What you get instead is transparency. You can inspect tensors, extract intermediate representations, or experiment with alternative strategies without leaving the language.

A Claude Code aside

We’ve been doing that all of the above with Claude Code, and it’s been a game-changer for understanding how these models work from 30,000 feet, because we all know the nitty-gritty typically ends up being something you learned in Calc II ;) But seriously, at the current rate of development in AI, it’s difficult to keep up and get things done at the same time!

In the coming days and weeks, we’ll be talking more about how we worked with Claude Code to transcribe PyTorch to R torch, among other R and gen AI-related things. We didn’t even intend on building a STT app, but when it’s this easy to get things done, you just go with it. Again, stay tuned!

stt.api: one function, many backends

Not everyone wants—or needs—a native implementation. Sometimes you just want transcription to work and already have an OpenAI key on your current machine.

That’s where stt.api comes in.

It provides a single stt() function that can route requests to different backends:

1library(stt.api)
2
3result <- stt("audio.mp3")

By default, it tries local options first and falls back to APIs only if needed. You can also be explicit:

1stt("audio.mp3", backend = "whisper")       # Native R torch
2stt("audio.mp3", backend = "audio.whisper") # whisper.cpp
3stt("audio.mp3", backend = "openai")        # API

The goal here is optionality without ceremony. You shouldn’t have to rewrite your code just because you moved from a laptop to a server, or from offline work to an API-backed workflow.

earshot: when you don’t want to write code

Some people just want to transcribe audio. They don’t want to think about models, devices, or backends. Or they want to know how front-end voice to text is supposed to work in Shiny.

earshot is a Shiny app that sits on top of stt.api and provides a simple UI for transcribing recorded or uploaded audio. It supports microphone input, file uploads, backend selection, model downloads, and audio preview… and now, just before Claude went down at 4:20pm CST, earshot supports streaming. Didn’t even get to the end of the blogpost! 🤯

1library(earshot)
2run_app()

It’s intentionally boring in the best way: open it, click record, get text.

Getting started

1install.packages("whisper")                       # from CRAN
2remotes::install_github("cornball-ai/stt.api")
3remotes::install_github("cornball-ai/earshot")

Then pick your entry point:

• whisper::transcribe() if you want full control • stt.api::stt() if you want flexibility • earshot::run_app() if you want a UI

All three packages are MIT licensed, and contributions are welcome.

Speech-to-Text in R