For applications requiring immediate feedback, such as live captioning, voice commands, or AI voice agents, the WebSocket endpoint provides real-time transcription. You can stream raw audio and receive transcripts back as soon as a speaker pauses, all with sub-250 ms latency. This endpoint’s behavior is primarily controlled by Voice Activity Detection (VAD), which intelligently segments the audio stream into utterances based on speech and silence.

How It Works

The process is straightforward:
  1. Connect: Establish a WebSocket connection to the /stream endpoint.
  2. Configure: Send an initial JSON message containing your desired configuration, including audio format and VAD settings.
  3. Stream: Send raw audio data as binary messages.
  4. Receive: Listen for JSON messages from the server containing the final transcript for each utterance.
  5. Close: Send an eos (end-of-stream) message to gracefully terminate the connection.

Get your API Key

Your API key is used to authenticate your requests. Follow this link to grab your first API key.
The following example uses your machine’s microphone as the audio source.

Install the SDK mic addon

pip install fennec-asr[mic]

Python SDK Example

An SDK Example (mic_ws_continuous_sdk.py)
import os, asyncio
from fennec_asr import Realtime
from fennec_asr.mic import stream_microphone

API_KEY = "YOUR_API_KEY"
SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK_MS = 32
SINGLE_UTTERANCE = False
DETECT_THOUGHTS = False

# Note: This is a very aggressive VAD setting for high speed transcription.
VAD = {
    "threshold": 0.5,  # VAD sensitivity. Lower is more sensitive. 0.5 is a good baseline.
    "min_silence_ms": 100,  # Milliseconds of silence to trigger an End-of-Speech event.
    "speech_pad_ms": 200,  # Milliseconds of audio to include before speech starts.
    "final_silence_s": 0.1, # Additional silence to wait for at the end.
    "start_trigger_ms": 36,  # How many ms of speech are needed to start transcribing.
    "min_voiced_ms": 48,  # Utterance is discarded if it has less than this much voiced audio.
    "min_chars": 1,  # Discard transcript if it has fewer characters than this.
    "min_words": 1,  # Discard transcript if it has fewer words than this.
    "amp_extend": 1200,  # A non-speech noise can extend utterance, but not start it.
    "force_decode_ms": 0,  # Forces a transcription after a set amount of ms have elapsed
    "debug": False,   # show debugging messages
}

async def main():
    if not API_KEY:
        raise RuntimeError("Set FENNEC_API_KEY")

    rt = (
        Realtime(API_KEY, sample_rate=SAMPLE_RATE, channels=CHANNELS, detect_thoughts=DETECT_THOUGHTS)
        .on("open",  lambda: print("✅ ready"))
        .on("final", lambda t: print("📝", t))
        .on("error", lambda e: print("❌", e))
    )

    rt._start_msg["single_utterance"] = SINGLE_UTTERANCE
    rt._start_msg["vad"] = VAD

    async with rt:
        await stream_microphone(rt, samplerate=SAMPLE_RATE, channels=CHANNELS, chunk_ms=CHUNK_MS)

if __name__ == "__main__":
    asyncio.run(main())

Understanding VAD

Voice Activity Detection acts as the Fennec’s “ears,” listening for speech and silence to intelligently segment the continuous audio stream into meaningful chunks, or “utterances.” Properly tuning the VAD settings is the most critical step to tailor the transcription behavior for your specific application. All VAD settings are passed inside the vad dictionary within your initial start message.

Core VAD Parameters

These parameters control the fundamental timing and sensitivity of the speech detection.
vad.threshold
float
VAD Sensitivity. This float between 0.0 and 1.0 determines how loud a sound must be to be considered speech. Lower values (e.g., 0.3) are more sensitive and can pick up quieter speakers or whispers, but may also misclassify background noise as speech. Higher values (e.g., 0.7) are less sensitive and are ideal for ignoring background noise in loud environments. 0.5 is a balanced starting point.
vad.min_silence_ms
integer
End-of-Speech Trigger. This is the most important parameter for controlling the latency-vs-completeness tradeoff. It defines the duration of silence (in milliseconds) the system will wait for before finalizing an utterance and sending the transcript. A low value (250) results in fast, short transcripts. A high value (1000) results in longer, more complete sentences but increases the perceived delay.
vad.speech_pad_ms
integer
Pre-Speech Audio Buffer. This setting captures a small amount of audio (in milliseconds) from before the VAD officially detected the start of speech. It is crucial for preventing the first syllable or word of an utterance from being cut off (e.g., ensuring “Okay, let’s start” isn’t transcribed as “kay, let’s start”).
vad.final_silence_s
integer
Post-Utterance Silence. This specifies an extra duration of silence (in seconds) to append to the end of a finalized utterance before it’s sent for transcription. This can sometimes provide the ASR model with more non-speech context, which can help in correctly punctuating the very end of a sentence, but it will directly add to the overall latency. Use with caution.
Advanced Filtering and Control Parameters These parameters provide fine-grained control over what constitutes a valid utterance, helping to filter out noise and unwanted segments.
vad.start_trigger_ms
integer
Minimum Speech to Start. An utterance will not begin until the VAD detects at least this many milliseconds of continuous speech. This helps prevent very short vocal tics or brief background noises from incorrectly starting a new transcription segment.
vad.min_voiced_ms
integer
Minimum Voiced Audio per Utterance. After an utterance is segmented, the system checks if it contains at least this much voiced audio. If not, the entire segment is discarded. This is a powerful tool for filtering out non-speech sounds like coughs, door slams, or clicks that might otherwise be long enough to form a segment.
vad.min_chars
integer
Minimum Transcript Character Length. After a segment is transcribed, the system checks the character count of the resulting text. If it’s less than this value, the transcript is discarded. This is a post-processing filter useful for eliminating transcripts of filler words like “a” or “uh”.
vad.min_words
integer
Minimum Transcript Word Count. Similar to min_chars, but works on the word count. Setting this to 2 would discard single-word utterances (e.g., “Okay”, “Right”).
vad.amp_extend
integer
Non-Speech Utterance Extension. This allows a non-speech sound (below the threshold) to extend a currently active speech utterance for up to this many milliseconds. However, such a sound cannot start a new utterance. This is useful for cases where a speaker’s voice trails off or a quiet background noise occurs mid-sentence, preventing the utterance from cutting off prematurely.
vad.force_decode_ms
integer
Maximum Utterance Duration. If this is set to a value greater than 0 (e.g., 15000 for 15 seconds), it acts as a safety net. The system will automatically finalize and transcribe an utterance after it reaches this duration, even if the speaker has not paused. This prevents infinitely long transcripts from users who don’t pause naturally.
vad.debug
boolean
Enable Debug Logging. If set to true, the server will send additional diagnostic messages along with the transcripts. This is useful for development and fine-tuning VAD settings, but should be disabled in production.
VAD Configuration Examples Here are three fully-configured START_MSG examples tailored for different real-world situations.

Scenario 1: General Purpose Dictation

This profile is balanced for users dictating notes or emails. It favors creating more complete sentences over instant responsiveness.
START_MSG = {
    "type": "start",
    "sample_rate": 16000,
    "channels": 1,
    "single_utterance": False,
    "vad": {
        "threshold": 0.45,
        "min_silence_ms": 800,      # Longer silence to allow for pauses mid-sentence
        "speech_pad_ms": 200,
        "start_trigger_ms": 36,
        "min_voiced_ms": 100,
        "min_chars": 2,
        "min_words": 1,
        "force_decode_ms": 60000   # Force a transcript after 60s for long-winded speakers
    }
}
Rationale:
  • min_silence_ms is high (800) to give users time to think between clauses without breaking the sentence.
  • force_decode_ms is enabled as a fallback, ensuring that even if a user talks for a long time, they will get a transcript back every 20 seconds.
  • threshold is slightly more sensitive (0.45) for more natural-sounding dictation.

Scenario 2: Transcription in a Noisy Environment

This profile is hardened for use-cases like a call center or public kiosk where significant background noise is expected. The goal is to reject as much non-speech audio as possible.
START_MSG = {
    "type": "start",
    "sample_rate": 16000,
    "channels": 1,
    "single_utterance": False,
    "vad": {
        "threshold": 0.65,          # High threshold to ignore background chatter
        "min_silence_ms": 400,
        "speech_pad_ms": 150,
        "start_trigger_ms": 100,    # Requires a clear, intentional start to speech
        "min_voiced_ms": 250,       # Very high requirement for voiced audio
        "min_chars": 1,
        "min_words": 1,
        "amp_extend": 0,            # Disable extension from quiet sounds
        "force_decode_ms": 0
    }
}
Rationale:
  • threshold is set high (0.65) to make the VAD less sensitive, effectively ignoring lower-volume background sounds.
  • start_trigger_ms and min_voiced_ms are both significantly increased to ensure only strong, clear speech from the primary speaker creates an utterance.
  • min_chars and min_words are increased to filter out partial words or phrases picked up from background conversations.
  • amp_extend is disabled (0) to prevent background hum or noise from incorrectly keeping an utterance alive.
Please know that building with VAD is an iterative process, and giving your end user indirect control of these parameters can enhance the experience for their specific situation.