Live Transcription

For applications requiring immediate feedback, AI voice agents, the WebSocket endpoint provides real-time transcription. You can stream raw audio and receive transcripts back as soon as a speaker pauses, all with sub-250 ms latency. This endpoint’s behavior is primarily controlled by Voice Activity Detection (VAD), which intelligently segments the audio stream into utterances based on speech and silence.

How It Works

Get a streaming token (required): Make an HTTP POST to /api/v1/transcribe/streaming-token with your API key in the X-API-Key header. The response is { "token": "<JWT>" }. Tokens are short-lived; fetch a new one for each connection.
Connect: Open a WebSocket to wss://api.fennec-asr.com/api/v1/transcribe/stream?streaming_token=<JWT>
Configure: Send a single JSON {"type":"start", ...} message (audio + VAD + optional features).
Stream: Send raw audio as binary messages.
Receive: Read JSON messages (finalized transcripts, and optionally VAD/utterance events).
Close: Send {"type":"eos"} to gracefully end the stream.

Get your API Key

Your API key is used to grab your streaming token. Follow this link to grab your first API key.

The following example uses your machine’s microphone as the audio source.

Install the SDK mic addon

pip install fennec-asr[mic]

Python SDK Example

An SDK Example (mic_ws_continuous_sdk.py)

import os, asyncio
from fennec_asr import Realtime
from fennec_asr.mic import stream_microphone

API_KEY = "YOUR_API_KEY"
SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK_MS = 32
SINGLE_UTTERANCE = False
DETECT_THOUGHTS = False

# Note: This is a very aggressive VAD setting for high speed transcription.
VAD = {
    "threshold": 0.5,  # VAD sensitivity. Lower is more sensitive. 0.5 is a good baseline.
    "min_silence_ms": 100,  # Milliseconds of silence to trigger an End-of-Speech event.
    "speech_pad_ms": 200,  # Milliseconds of audio to include before speech starts.
    "final_silence_s": 0.1, # Additional silence to wait for at the end.
    "start_trigger_ms": 36,  # How many ms of speech are needed to start transcribing.
    "min_voiced_ms": 48,  # Utterance is discarded if it has less than this much voiced audio.
    "min_chars": 1,  # Discard transcript if it has fewer characters than this.
    "min_words": 1,  # Discard transcript if it has fewer words than this.
    "amp_extend": 1200,  # A non-speech noise can extend utterance, but not start it.
    "force_decode_ms": 0,  # Forces a transcription after a set amount of ms have elapsed
    "debug": False,   # show debugging messages
}

async def main():
    if not API_KEY:
        raise RuntimeError("Set FENNEC_API_KEY")

    rt = (
        Realtime(API_KEY, sample_rate=SAMPLE_RATE, channels=CHANNELS, detect_thoughts=DETECT_THOUGHTS)
        .on("open",  lambda: print("✅ ready"))
        .on("final", lambda t: print("📝", t))
        .on("error", lambda e: print("❌", e))
    )

    rt._start_msg["single_utterance"] = SINGLE_UTTERANCE
    rt._start_msg["vad"] = VAD

    async with rt:
        await stream_microphone(rt, samplerate=SAMPLE_RATE, channels=CHANNELS, chunk_ms=CHUNK_MS)

if __name__ == "__main__":
    asyncio.run(main())

Code Samples

A Full Example (mic_ws_continuous.py)

import asyncio
import json
import signal
import sounddevice as sd
import websockets

# --- API Configuration ---
# 1) Fetch a short-lived streaming token over HTTPS using your API key.
# 2) Use that token to authenticate the WebSocket connection via the URL query string.
HTTP_TOKEN_URL = "https://api.fennec-asr.com/api/v1/transcribe/streaming-token"
WS_BASE = "wss://api.fennec-asr.com/api/v1/transcribe/stream"
API_KEY = os.getenv("FENNEC_API_KEY") or "YOUR_API_KEY_HERE"

# --- Audio Configuration ---
SAMPLE_RATE = 16_000
CHANNELS = 1
DTYPE = "int16"
CHUNK_MS = 100
FRAMES_PER_CHUNK = int(SAMPLE_RATE * (CHUNK_MS / 1000.0))

# --- WebSocket Start Message ---
# This dictionary is sent once upon connection to configure the VAD.
# Note: This is a very aggressive VAD setting for high speed transcription.
START_MSG = {
    "type": "start",
    "sample_rate": 16000,
    "channels": 1,
    "single_utterance": False,  # Keep connection open after each transcript
    "vad": {
        "threshold": 0.5,        # VAD sensitivity. Lower is more sensitive. 0.5 is a good baseline.
        "min_silence_ms": 100,   # Milliseconds of silence to trigger an End-of-Speech event.
        "speech_pad_ms": 200,    # Milliseconds of audio to include before speech starts.
        "final_silence_s": 0.1,  # Additional silence to wait for at the end.
        "start_trigger_ms": 36,  # How many ms of speech are needed to start transcribing.
        "min_voiced_ms": 48,     # Utterance is discarded if it has less than this much voiced audio.
        "min_chars": 1,          # Discard transcript if it has fewer characters than this.
        "min_words": 1,          # Discard transcript if it has fewer words than this.
        "amp_extend": 1200,      # A non-speech noise can extend utterance, but not start it.
        "force_decode_ms": 0,    # Forces a transcription after a set amount of ms have elapsed
        "debug": False,          # show debugging messages
    }
}

# --- Graceful Shutdown ---
shutdown_event = asyncio.Event()

def _handle_sigint(*_):
    print("\nStopping…", flush=True)
    shutdown_event.set()

async def fetch_streaming_token() -> str:
    """
    Exchanges your API key for a short-lived streaming token.
    Send the token via ?streaming_token=... in the WS URL.
    """
    if not API_KEY or API_KEY == "YOUR_API_KEY_HERE":
        raise RuntimeError("Set FENNEC_API_KEY (or replace API_KEY).")
    async with httpx.AsyncClient(timeout=10) as client:
        # The ASR API reads your key from the X-API-Key header and returns {"token": "<jwt>"}
        resp = await client.post(
            HTTP_TOKEN_URL,
            headers={"X-API-Key": API_KEY, "content-type": "application/json"},
            json={},  # body not required; header is what matters
        )
        resp.raise_for_status()
        data = resp.json()
        token = data.get("token")
        if not token:
            raise RuntimeError(f"Token endpoint returned no token: {data}")
        return token

async def audio_sender(ws, stream):
    """Reads from the mic and sends audio chunks to the WebSocket."""
    print("\n🎙️  Streaming mic… speak, and pause to see the transcript. (Ctrl+C to stop)", flush=True)
    while not shutdown_event.is_set():
        try:
            data, _ = await asyncio.to_thread(stream.read, FRAMES_PER_CHUNK)
            await ws.send(bytes(data))
        except websockets.exceptions.ConnectionClosed:
            break

    try:
        await ws.send('{"type":"eos"}')
    except websockets.exceptions.ConnectionClosed:
        pass

async def message_receiver(ws):
    """Listens for messages from the server and prints them."""
    async for msg in ws:
        try:
            data = json.loads(msg)
            if text := data.get("text"):
                print(text, flush=True)
        except Exception:
            # Ignore non-JSON messages
            pass

async def main():
    """Main function to connect and stream microphone audio."""
    print("Fetching streaming token…")
    token = await fetch_streaming_token()
    WEBSOCKET_URL = f"{WS_BASE}?streaming_token={token}"

    print(f"Connecting to: {WEBSOCKET_URL}")
    try:
        async with websockets.connect(
            WEBSOCKET_URL,
            max_size=None,
            ping_interval=5,
        ) as ws:
            # Send the configuration message.
            await ws.send(json.dumps(START_MSG))
            print("✅ WebSocket connected. Sent VAD configuration.")

            with sd.RawInputStream(
                    samplerate=SAMPLE_RATE,
                    channels=CHANNELS,
                    dtype=DTYPE,
                    blocksize=FRAMES_PER_CHUNK,
            ) as stream:
                # Run the sender and receiver concurrently.
                sender_task = asyncio.create_task(audio_sender(ws, stream))
                receiver_task = asyncio.create_task(message_receiver(ws))
                await asyncio.wait([sender_task, receiver_task], return_when=asyncio.FIRST_COMPLETED)

    except Exception as e:
        print(f"❌ An unexpected error occurred: {e}")

if __name__ == "__main__":
    if not sd.query_devices(kind='input'):
        print("\n❌ No input microphone found.")
    else:
        signal.signal(signal.SIGINT, _handle_sigint)
        asyncio.run(main())

Understanding VAD

Voice Activity Detection acts as the Fennec’s “ears,” listening for speech and silence to intelligently segment the continuous audio stream into meaningful chunks, or “utterances.” Properly tuning the VAD settings is the most critical step to tailor the transcription behavior for your specific application. All VAD settings are passed inside the vad dictionary within your initial start message.

Core VAD Parameters

These parameters control the fundamental timing and sensitivity of the speech detection.

vad.threshold

float

VAD Sensitivity. This float between 0.0 and 1.0 determines how loud a sound must be to be considered speech. Lower values (e.g., 0.3) are more sensitive and can pick up quieter speakers or whispers, but may also misclassify background noise as speech. Higher values (e.g., 0.7) are less sensitive and are ideal for ignoring background noise in loud environments. 0.5 is a balanced starting point.

vad.min_silence_ms

integer

End-of-Speech Trigger. This is the most important parameter for controlling the latency-vs-completeness tradeoff. It defines the duration of silence (in milliseconds) the system will wait for before finalizing an utterance and sending the transcript. A low value (250) results in fast, short transcripts. A high value (1000) results in longer, more complete sentences but increases the perceived delay.

vad.speech_pad_ms

integer

Pre-Speech Audio Buffer. This setting captures a small amount of audio (in milliseconds) from before the VAD officially detected the start of speech. It is crucial for preventing the first syllable or word of an utterance from being cut off (e.g., ensuring “Okay, let’s start” isn’t transcribed as “kay, let’s start”).

vad.final_silence_s

float

Post-Utterance Silence. This specifies an extra duration of silence (in seconds) to append to the end of a finalized utterance before it’s sent for transcription. This can sometimes provide the ASR model with more non-speech context, which can help in correctly punctuating the very end of a sentence, but it will directly add to the overall latency. Use with caution.

Advanced Filtering and Control Parameters

These parameters provide fine-grained control over what constitutes a valid utterance, helping to filter out noise and unwanted segments.

vad.start_trigger_ms

integer

Minimum Speech to Start. An utterance will not begin until the VAD detects at least this many milliseconds of continuous speech. This helps prevent very short vocal tics or brief background noises from incorrectly starting a new transcription segment.

vad.min_voiced_ms

integer

Minimum Voiced Audio per Utterance. After an utterance is segmented, the system checks if it contains at least this much voiced audio. If not, the entire segment is discarded. This is a powerful tool for filtering out non-speech sounds like coughs, door slams, or clicks that might otherwise be long enough to form a segment.

vad.min_chars

integer

Minimum Transcript Character Length. After a segment is transcribed, the system checks the character count of the resulting text. If it’s less than this value, the transcript is discarded. This is a post-processing filter useful for eliminating transcripts of filler words like “a” or “uh”.

vad.min_words

integer

Minimum Transcript Word Count. Similar to min_chars, but works on the word count. Setting this to 2 would discard single-word utterances (e.g., “Okay”, “Right”).

vad.amp_extend

integer

Non-Speech Utterance Extension. This allows a non-speech sound (below the threshold) to extend a currently active speech utterance for up to this many milliseconds. However, such a sound cannot start a new utterance. This is useful for cases where a speaker’s voice trails off or a quiet background noise occurs mid-sentence, preventing the utterance from cutting off prematurely.

vad.force_decode_ms

integer

Maximum Utterance Duration. If this is set to a value greater than 0 (e.g., 15000 for 15 seconds), it acts as a safety net. The system will automatically finalize and transcribe an utterance after it reaches this duration, even if the speaker has not paused. This prevents infinitely long transcripts from users who don’t pause naturally.

vad.debug

boolean

Enable Debug Logging. If set to true, the server will send additional diagnostic messages along with the transcripts. This is useful for development and fine-tuning VAD settings, but should be disabled in production.

VAD Configuration Examples

Here are three fully-configured START_MSG examples tailored for different real-world situations.

Scenario 1: General Purpose Dictation

This profile is balanced for users dictating notes or emails. It favors creating more complete sentences over instant responsiveness.

START_MSG = {
    "type": "start",
    "sample_rate": 16000,
    "channels": 1,
    "single_utterance": False,
    "vad": {
        "threshold": 0.45,
        "min_silence_ms": 800,      # Longer silence to allow for pauses mid-sentence
        "speech_pad_ms": 200,
        "start_trigger_ms": 36,
        "min_voiced_ms": 100,
        "min_chars": 2,
        "min_words": 1,
        "force_decode_ms": 60000   # Force a transcript after 60s for long-winded speakers
    }
}

Rationale:

min_silence_ms is high (800) to give users time to think between clauses without breaking the sentence.
force_decode_ms is enabled as a fallback, ensuring that even if a user talks for a long time, they will get a transcript back every 20 seconds.
threshold is slightly more sensitive (0.45) for more natural-sounding dictation.

Scenario 2: Transcription in a Noisy Environment

This profile is hardened for use-cases like a call center or public kiosk where significant background noise is expected. The goal is to reject as much non-speech audio as possible.

START_MSG = {
    "type": "start",
    "sample_rate": 16000,
    "channels": 1,
    "single_utterance": False,
    "vad": {
        "threshold": 0.65,          # High threshold to ignore background chatter
        "min_silence_ms": 400,
        "speech_pad_ms": 150,
        "start_trigger_ms": 100,    # Requires a clear, intentional start to speech
        "min_voiced_ms": 250,       # Very high requirement for voiced audio
        "min_chars": 1,
        "min_words": 1,
        "amp_extend": 0,            # Disable extension from quiet sounds
        "force_decode_ms": 0
    }
}

Rationale:

threshold is set high (0.65) to make the VAD less sensitive, effectively ignoring lower-volume background sounds.
start_trigger_ms and min_voiced_ms are both significantly increased to ensure only strong, clear speech from the primary speaker creates an utterance.
min_chars and min_words are increased to filter out partial words or phrases picked up from background conversations.
amp_extend is disabled (0) to prevent background hum or noise from incorrectly keeping an utterance alive.

Please know that building with VAD is an iterative process, and giving your end user indirect control of these parameters can enhance the experience for their specific situation.

Live Transcribe with Websockets

Batch Transcribe

Batch Transcribe Features

How It Works

Get your API Key

Install the SDK mic addon

Python SDK Example

Understanding VAD

Core VAD Parameters

Advanced Filtering and Control Parameters

VAD Configuration Examples

Scenario 1: General Purpose Dictation

Scenario 2: Transcription in a Noisy Environment

Live Transcribe with Websockets

Batch Transcribe

Batch Transcribe Features

​How It Works

​Get your API Key

​Install the SDK mic addon

​Python SDK Example

​Understanding VAD

​Core VAD Parameters

​Advanced Filtering and Control Parameters

​VAD Configuration Examples

Scenario 1: General Purpose Dictation

Scenario 2: Transcription in a Noisy Environment

How It Works

Get your API Key

Install the SDK mic addon

Python SDK Example

Understanding VAD

Core VAD Parameters

Advanced Filtering and Control Parameters

VAD Configuration Examples