How It Works
The process is straightforward:- Connect: Establish a WebSocket connection to the
/stream
endpoint. - Configure: Send an initial JSON message containing your desired configuration, including audio format and VAD settings.
- Stream: Send raw audio data as binary messages.
- Receive: Listen for JSON messages from the server containing the final transcript for each utterance.
- Close: Send an
eos
(end-of-stream) message to gracefully terminate the connection.
Get your API Key
Your API key is used to authenticate your requests. Follow this link to grab your first API key.The following example uses your machine’s microphone as the audio source.
Install the SDK mic addon
Python SDK Example
An SDK Example (mic_ws_continuous_sdk.py)
Code Samples
Code Samples
A Full Example (mic_ws_continuous.py)
Understanding VAD
Voice Activity Detection acts as the Fennec’s “ears,” listening for speech and silence to intelligently segment the continuous audio stream into meaningful chunks, or “utterances.” Properly tuning the VAD settings is the most critical step to tailor the transcription behavior for your specific application. All VAD settings are passed inside the vad dictionary within your initial start message.Core VAD Parameters
These parameters control the fundamental timing and sensitivity of the speech detection.VAD Sensitivity. This float between 0.0 and 1.0 determines how loud a sound must be to be considered speech. Lower values (e.g., 0.3) are more sensitive and can pick up quieter speakers or whispers, but may also misclassify background noise as speech. Higher values (e.g., 0.7) are less sensitive and are ideal for ignoring background noise in loud environments. 0.5 is a balanced starting point.
End-of-Speech Trigger. This is the most important parameter for controlling the latency-vs-completeness tradeoff. It defines the duration of silence (in milliseconds) the system will wait for before finalizing an utterance and sending the transcript. A low value (250) results in fast, short transcripts. A high value (1000) results in longer, more complete sentences but increases the perceived delay.
Pre-Speech Audio Buffer. This setting captures a small amount of audio (in milliseconds) from before the VAD officially detected the start of speech. It is crucial for preventing the first syllable or word of an utterance from being cut off (e.g., ensuring “Okay, let’s start” isn’t transcribed as “kay, let’s start”).
Post-Utterance Silence. This specifies an extra duration of silence (in seconds) to append to the end of a finalized utterance before it’s sent for transcription. This can sometimes provide the ASR model with more non-speech context, which can help in correctly punctuating the very end of a sentence, but it will directly add to the overall latency. Use with caution.
Minimum Speech to Start. An utterance will not begin until the VAD detects at least this many milliseconds of continuous speech. This helps prevent very short vocal tics or brief background noises from incorrectly starting a new transcription segment.
Minimum Voiced Audio per Utterance. After an utterance is segmented, the system checks if it contains at least this much voiced audio. If not, the entire segment is discarded. This is a powerful tool for filtering out non-speech sounds like coughs, door slams, or clicks that might otherwise be long enough to form a segment.
Minimum Transcript Character Length. After a segment is transcribed, the system checks the character count of the resulting text. If it’s less than this value, the transcript is discarded. This is a post-processing filter useful for eliminating transcripts of filler words like “a” or “uh”.
Minimum Transcript Word Count. Similar to min_chars, but works on the word count. Setting this to 2 would discard single-word utterances (e.g., “Okay”, “Right”).
Non-Speech Utterance Extension. This allows a non-speech sound (below the threshold) to extend a currently active speech utterance for up to this many milliseconds. However, such a sound cannot start a new utterance. This is useful for cases where a speaker’s voice trails off or a quiet background noise occurs mid-sentence, preventing the utterance from cutting off prematurely.
Maximum Utterance Duration. If this is set to a value greater than 0 (e.g., 15000 for 15 seconds), it acts as a safety net. The system will automatically finalize and transcribe an utterance after it reaches this duration, even if the speaker has not paused. This prevents infinitely long transcripts from users who don’t pause naturally.
Enable Debug Logging. If set to true, the server will send additional diagnostic messages along with the transcripts. This is useful for development and fine-tuning VAD settings, but should be disabled in production.
Scenario 1: General Purpose Dictation
This profile is balanced for users dictating notes or emails. It favors creating more complete sentences over instant responsiveness.Rationale:
min_silence_ms
is high (800
) to give users time to think between clauses without breaking the sentence.force_decode_ms
is enabled as a fallback, ensuring that even if a user talks for a long time, they will get a transcript back every 20 seconds.threshold
is slightly more sensitive (0.45
) for more natural-sounding dictation.
Scenario 2: Transcription in a Noisy Environment
This profile is hardened for use-cases like a call center or public kiosk where significant background noise is expected. The goal is to reject as much non-speech audio as possible.Rationale:
threshold
is set high (0.65
) to make the VAD less sensitive, effectively ignoring lower-volume background sounds.start_trigger_ms
andmin_voiced_ms
are both significantly increased to ensure only strong, clear speech from the primary speaker creates an utterance.min_chars
andmin_words
are increased to filter out partial words or phrases picked up from background conversations.amp_extend
is disabled (0
) to prevent background hum or noise from incorrectly keeping an utterance alive.