Speech to Text
Description
Convert an audio file to text using an OpenAI-compatible API or local audio.whisper backend.
Usage
1stt(
2 file,
3 model = NULL,
4 language = NULL,
5 response_format = c("json", "text", "verbose_json"),
6 backend = c("auto", "whisper", "audio.whisper", "openai", "fal"),
7 prompt = NULL
8)
Arguments
file: Path to the audio file to convert.model: Model name to use for transcription. For API backends, this is passed directly (e.g., “whisper-1”). For audio.whisper, this is the model size (e.g., “tiny”, “base”, “small”, “medium”, “large”). If NULL, uses the backend’s default.language: Language code (e.g., “en”, “es”, “fr”). Optional hint to improve transcription accuracy.response_format: Response format for API backend. One of “text”, “json”, or “verbose_json”. Ignored for audio.whisper backend.backend: Which backend to use: “auto” (default), “whisper”, “audio.whisper”, “openai”, or “fal”. Auto mode tries native whisper first, then audio.whisper, then openai API (if configured), then fal.api.prompt: Optional text to guide the transcription. For API backend, this is passed as initial_prompt to help with spelling of names, acronyms, or domain-specific terms. Ignored for audio.whisper backend (not supported by underlying library).
Value
A list with components:
- text: The transcribed text as a single string.
- segments: A data.frame of segments with timing info, or NULL.
- language: The detected or specified language code.
- backend: Which backend was used (“api” or “audio.whisper”).
- raw: The raw response from the backend.
Examples
1# Using OpenAI API
2set_stt_base("https://api.openai.com")
3set_stt_key(Sys.getenv("OPENAI_API_KEY"))
4result <- stt("speech.wav", model = "whisper-1")
5result$text
6
7# Using local server
8set_stt_base("http://localhost:4123")
9result <- stt("speech.wav")
10
11# Using audio.whisper directly
12result <- stt("speech.wav", backend = "audio.whisper")