> ## Documentation Index
> Fetch the complete documentation index at: https://docs.fennec-asr.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Live Transcription

> Stream audio directly from any source and receive low-latency transcripts in real-time using our WebSocket endpoint, with fine-grained VAD controls.

For applications requiring immediate feedback, AI voice agents, the WebSocket endpoint provides real-time transcription. You can stream raw audio and receive transcripts back as soon as a speaker pauses, all with sub-250 ms latency.

This endpoint's behavior is primarily controlled by Voice Activity Detection (VAD), which intelligently segments the audio stream into utterances based on speech and silence.

## How It Works

1. **Get a streaming token (required):**
   Make an HTTP POST to `/api/v1/transcribe/streaming-token` with your API key in the `X-API-Key` header.
   The response is `{ "token": "<JWT>" }`. Tokens are short-lived; fetch a new one for each connection.

2. **Connect:**
   Open a WebSocket to
   `wss://api.fennec-asr.com/api/v1/transcribe/stream?streaming_token=<JWT>`

3. **Configure:**
   Send a single JSON `{"type":"start", ...}` message (audio + VAD + optional features).

4. **Stream:**
   Send raw audio as binary messages.

5. **Receive:**
   Read JSON messages (finalized transcripts, and optionally VAD/utterance events).

6. **Close:**
   Send `{"type":"eos"}` to gracefully end the stream.

### Get your API Key

Your API key is used to grab your streaming token. Follow [this link](https://app.fennec-asr.com) to grab your first API key.

***

The following example uses your machine's microphone as the audio source.

### Install the SDK mic addon

```bash theme={null}
pip install fennec-asr[mic]
```

### Python SDK Example

```python An SDK Example (mic_ws_continuous_sdk.py) theme={null}
import os, asyncio
from fennec_asr import Realtime
from fennec_asr.mic import stream_microphone

API_KEY = "YOUR_API_KEY"
SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK_MS = 32
SINGLE_UTTERANCE = False
DETECT_THOUGHTS = False

# Note: This is a very aggressive VAD setting for high speed transcription.
VAD = {
    "threshold": 0.5,  # VAD sensitivity. Lower is more sensitive. 0.5 is a good baseline.
    "min_silence_ms": 100,  # Milliseconds of silence to trigger an End-of-Speech event.
    "speech_pad_ms": 200,  # Milliseconds of audio to include before speech starts.
    "final_silence_s": 0.1, # Additional silence to wait for at the end.
    "start_trigger_ms": 36,  # How many ms of speech are needed to start transcribing.
    "min_voiced_ms": 48,  # Utterance is discarded if it has less than this much voiced audio.
    "min_chars": 1,  # Discard transcript if it has fewer characters than this.
    "min_words": 1,  # Discard transcript if it has fewer words than this.
    "amp_extend": 1200,  # A non-speech noise can extend utterance, but not start it.
    "force_decode_ms": 0,  # Forces a transcription after a set amount of ms have elapsed
    "debug": False,   # show debugging messages
}

async def main():
    if not API_KEY:
        raise RuntimeError("Set FENNEC_API_KEY")

    rt = (
        Realtime(API_KEY, sample_rate=SAMPLE_RATE, channels=CHANNELS, detect_thoughts=DETECT_THOUGHTS)
        .on("open",  lambda: print("✅ ready"))
        .on("final", lambda t: print("📝", t))
        .on("error", lambda e: print("❌", e))
    )

    rt._start_msg["single_utterance"] = SINGLE_UTTERANCE
    rt._start_msg["vad"] = VAD

    async with rt:
        await stream_microphone(rt, samplerate=SAMPLE_RATE, channels=CHANNELS, chunk_ms=CHUNK_MS)

if __name__ == "__main__":
    asyncio.run(main())

```

<Accordion icon="language" title="Code Samples">
  <CodeGroup dropdown>
    ```python A Full Example (mic_ws_continuous.py) theme={null}
    import asyncio
    import json
    import signal
    import sounddevice as sd
    import websockets

    # --- API Configuration ---
    # 1) Fetch a short-lived streaming token over HTTPS using your API key.
    # 2) Use that token to authenticate the WebSocket connection via the URL query string.
    HTTP_TOKEN_URL = "https://api.fennec-asr.com/api/v1/transcribe/streaming-token"
    WS_BASE = "wss://api.fennec-asr.com/api/v1/transcribe/stream"
    API_KEY = os.getenv("FENNEC_API_KEY") or "YOUR_API_KEY_HERE"

    # --- Audio Configuration ---
    SAMPLE_RATE = 16_000
    CHANNELS = 1
    DTYPE = "int16"
    CHUNK_MS = 100
    FRAMES_PER_CHUNK = int(SAMPLE_RATE * (CHUNK_MS / 1000.0))

    # --- WebSocket Start Message ---
    # This dictionary is sent once upon connection to configure the VAD.
    # Note: This is a very aggressive VAD setting for high speed transcription.
    START_MSG = {
        "type": "start",
        "sample_rate": 16000,
        "channels": 1,
        "single_utterance": False,  # Keep connection open after each transcript
        "vad": {
            "threshold": 0.5,        # VAD sensitivity. Lower is more sensitive. 0.5 is a good baseline.
            "min_silence_ms": 100,   # Milliseconds of silence to trigger an End-of-Speech event.
            "speech_pad_ms": 200,    # Milliseconds of audio to include before speech starts.
            "final_silence_s": 0.1,  # Additional silence to wait for at the end.
            "start_trigger_ms": 36,  # How many ms of speech are needed to start transcribing.
            "min_voiced_ms": 48,     # Utterance is discarded if it has less than this much voiced audio.
            "min_chars": 1,          # Discard transcript if it has fewer characters than this.
            "min_words": 1,          # Discard transcript if it has fewer words than this.
            "amp_extend": 1200,      # A non-speech noise can extend utterance, but not start it.
            "force_decode_ms": 0,    # Forces a transcription after a set amount of ms have elapsed
            "debug": False,          # show debugging messages
        }
    }

    # --- Graceful Shutdown ---
    shutdown_event = asyncio.Event()

    def _handle_sigint(*_):
        print("\nStopping…", flush=True)
        shutdown_event.set()

    async def fetch_streaming_token() -> str:
        """
        Exchanges your API key for a short-lived streaming token.
        Send the token via ?streaming_token=... in the WS URL.
        """
        if not API_KEY or API_KEY == "YOUR_API_KEY_HERE":
            raise RuntimeError("Set FENNEC_API_KEY (or replace API_KEY).")
        async with httpx.AsyncClient(timeout=10) as client:
            # The ASR API reads your key from the X-API-Key header and returns {"token": "<jwt>"}
            resp = await client.post(
                HTTP_TOKEN_URL,
                headers={"X-API-Key": API_KEY, "content-type": "application/json"},
                json={},  # body not required; header is what matters
            )
            resp.raise_for_status()
            data = resp.json()
            token = data.get("token")
            if not token:
                raise RuntimeError(f"Token endpoint returned no token: {data}")
            return token

    async def audio_sender(ws, stream):
        """Reads from the mic and sends audio chunks to the WebSocket."""
        print("\n🎙️  Streaming mic… speak, and pause to see the transcript. (Ctrl+C to stop)", flush=True)
        while not shutdown_event.is_set():
            try:
                data, _ = await asyncio.to_thread(stream.read, FRAMES_PER_CHUNK)
                await ws.send(bytes(data))
            except websockets.exceptions.ConnectionClosed:
                break

        try:
            await ws.send('{"type":"eos"}')
        except websockets.exceptions.ConnectionClosed:
            pass

    async def message_receiver(ws):
        """Listens for messages from the server and prints them."""
        async for msg in ws:
            try:
                data = json.loads(msg)
                if text := data.get("text"):
                    print(text, flush=True)
            except Exception:
                # Ignore non-JSON messages
                pass

    async def main():
        """Main function to connect and stream microphone audio."""
        print("Fetching streaming token…")
        token = await fetch_streaming_token()
        WEBSOCKET_URL = f"{WS_BASE}?streaming_token={token}"

        print(f"Connecting to: {WEBSOCKET_URL}")
        try:
            async with websockets.connect(
                WEBSOCKET_URL,
                max_size=None,
                ping_interval=5,
            ) as ws:
                # Send the configuration message.
                await ws.send(json.dumps(START_MSG))
                print("✅ WebSocket connected. Sent VAD configuration.")

                with sd.RawInputStream(
                        samplerate=SAMPLE_RATE,
                        channels=CHANNELS,
                        dtype=DTYPE,
                        blocksize=FRAMES_PER_CHUNK,
                ) as stream:
                    # Run the sender and receiver concurrently.
                    sender_task = asyncio.create_task(audio_sender(ws, stream))
                    receiver_task = asyncio.create_task(message_receiver(ws))
                    await asyncio.wait([sender_task, receiver_task], return_when=asyncio.FIRST_COMPLETED)

        except Exception as e:
            print(f"❌ An unexpected error occurred: {e}")

    if __name__ == "__main__":
        if not sd.query_devices(kind='input'):
            print("\n❌ No input microphone found.")
        else:
            signal.signal(signal.SIGINT, _handle_sigint)
            asyncio.run(main())
    ```

    ```ts theme={null}
    #!/usr/bin/env node
    "use strict";

    /**
     * Streams microphone audio to the ASR WebSocket API with VAD configuration,
     * printing finalized transcripts as they arrive. (Ctrl+C to stop)
     *
     * Auth flow:
     *   1) POST your API key to the HTTPS token endpoint to get a short-lived streaming token.
     *   2) Connect to the WebSocket using ?streaming_token=... in the URL (no header needed).
     *
     * Requires:
     *   - Node 18+ (for global fetch)
     *   - npm i ws mic dotenv
     *   - A working system recorder (arecord/sox/CoreAudio depending on OS)
     *   - .env with FENNEC_API_KEY=your_key
     */

    require("dotenv").config();

    const HTTP_TOKEN_URL = "https://api.fennec-asr.com/api/v1/transcribe/streaming-token";
    const WS_BASE = "wss://api.fennec-asr.com/api/v1/transcribe/stream";
    const KEY = process.env.FENNEC_API_KEY;
    if (!KEY) {
      console.error("Set FENNEC_API_KEY in your environment (or .env).");
      process.exit(1);
    }

    // --- Audio Configuration ---
    const SAMPLE_RATE = 16_000;
    const CHANNELS = 1;
    const CHUNK_MS = 100;
    const DTYPE_BYTES = 2; // int16 -> 2 bytes/sample
    const FRAMES_PER_CHUNK = Math.floor(SAMPLE_RATE * (CHUNK_MS / 1000));
    const CHUNK_BYTES = FRAMES_PER_CHUNK * CHANNELS * DTYPE_BYTES;

    // --- Start Message (matches the Python example) ---
    const START_MSG = {
      type: "start",
      sample_rate: 16000,
      channels: 1,
      single_utterance: false, // Keep connection open after each transcript
      vad: {
        threshold: 0.5,        // Lower => more sensitive
        min_silence_ms: 100,   // Silence to trigger end-of-speech
        speech_pad_ms: 200,    // Audio to include before speech starts
        final_silence_s: 0.1,  // Extra silence to wait at the end
        start_trigger_ms: 36,  // How many ms of speech to start transcribing
        min_voiced_ms: 48,     // Discard utterances with less voiced audio
        min_chars: 1,
        min_words: 1,
        amp_extend: 1200,      // Non-speech noise can extend utterance, not start it
        force_decode_ms: 0,    // Force a transcription after N ms
        debug: false,
      },
    };

    (async () => {
      const WebSocket = require("ws");
      const mic = require("mic");

      async function fetchStreamingToken() {
        // The ASR API reads your key from the X-API-Key header and returns { token: "<jwt>" }
        const resp = await fetch(HTTP_TOKEN_URL, {
          method: "POST",
          headers: {
            "X-API-Key": KEY,
            "content-type": "application/json",
          },
          body: JSON.stringify({}), // body not required; header is what matters
        });
        if (!resp.ok) {
          const text = await resp.text().catch(() => "");
          throw new Error(`Token request failed: ${resp.status} ${text}`);
        }
        const data = await resp.json();
        if (!data?.token) throw new Error("Token endpoint returned no token");
        return data.token;
      }

      console.log("Fetching streaming token…");
      const token = await fetchStreamingToken();
      const WEBSOCKET_URL = `${WS_BASE}?streaming_token=${encodeURIComponent(token)}`;
      console.log(`Connecting to: ${WEBSOCKET_URL}`);

      const ws = new WebSocket(WEBSOCKET_URL, {
        perMessageDeflate: false,
        maxPayload: 0,
      });

      // Graceful shutdown helpers
      let shuttingDown = false;
      const shutdown = (micInstance) => {
        if (shuttingDown) return;
        shuttingDown = true;
        console.log("\nStopping…");
        try {
          if (ws.readyState === WebSocket.OPEN) ws.send('{"type":"eos"}');
        } catch {}
        try {
          micInstance?.stop();
        } catch {}
        try {
          ws.close();
        } catch {}
      };

      process.on("SIGINT", () => shutdown());

      ws.on("open", () => {
        (async () => {
          try {
            await ws.send(JSON.stringify(START_MSG));
            console.log("✅ WebSocket connected. Sent VAD configuration.");

            // Configure mic capture (raw signed 16-bit little-endian, mono, 16 kHz)
            const micInstance = mic({
              rate: String(SAMPLE_RATE),
              channels: String(CHANNELS),
              bitwidth: "16",
              encoding: "signed-integer",
              endian: "little",
              device: process.env.MIC_DEVICE || undefined, // optional override
              fileType: "raw",
              exitOnSilence: 0,
            });

            const micStream = micInstance.getAudioStream();

            let buffer = Buffer.alloc(0);

            console.log(
              "\n🎙️  Streaming mic… speak, and pause to see the transcript. (Ctrl+C to stop)"
            );

            micStream.on("data", (chunk) => {
              // Accumulate mic bytes and send fixed-size chunks
              buffer = Buffer.concat([buffer, chunk]);
              while (buffer.length >= CHUNK_BYTES && ws.readyState === WebSocket.OPEN) {
                const slice = buffer.subarray(0, CHUNK_BYTES);
                ws.send(slice); // binary audio payload
                buffer = buffer.subarray(CHUNK_BYTES);
              }
            });

            micStream.on("error", (err) => {
              console.error("Mic error:", err?.message || err);
            });

            ws.on("message", (msg) => {
              try {
                const data = JSON.parse(msg.toString("utf8"));
                if (data?.text) {
                  console.log(data.text);
                }
              } catch {
                // Ignore non-JSON messages
              }
            });

            ws.on("close", () => shutdown(micInstance));
            ws.on("error", (e) => {
              console.error("WebSocket error:", e?.message || e);
              shutdown(micInstance);
            });

            micInstance.start();
          } catch (e) {
            console.error("Startup error:", e?.message || e);
            shutdown();
          }
        })();
      });

      ws.on("error", (e) => {
        console.error("WebSocket error (pre-open):", e?.message || e);
      });
    })();

    ```
  </CodeGroup>
</Accordion>

## Understanding VAD

Voice Activity Detection acts as the Fennec's "ears," listening for speech and silence to intelligently segment the continuous audio stream into meaningful chunks, or "utterances." Properly tuning the VAD settings is the most critical step to tailor the transcription behavior for your specific application.

All VAD settings are passed inside the vad dictionary within your initial start message.

### Core VAD Parameters

These parameters control the fundamental timing and sensitivity of the speech detection.

<ParamField body="vad.threshold" type="float">
  VAD Sensitivity. This float between 0.0 and 1.0 determines how loud a sound must be to be considered speech. Lower values (e.g., 0.3) are more sensitive and can pick up quieter speakers or whispers, but may also misclassify background noise as speech. Higher values (e.g., 0.7) are less sensitive and are ideal for ignoring background noise in loud environments. 0.5 is a balanced starting point.
</ParamField>

<ParamField body="vad.min_silence_ms" type="integer">
  End-of-Speech Trigger. This is the most important parameter for controlling the latency-vs-completeness tradeoff. It defines the duration of silence (in milliseconds) the system will wait for before finalizing an utterance and sending the transcript. A low value (250) results in fast, short transcripts. A high value (1000) results in longer, more complete sentences but increases the perceived delay.
</ParamField>

<ParamField body="vad.speech_pad_ms" type="integer">
  Pre-Speech Audio Buffer. This setting captures a small amount of audio (in milliseconds) from before the VAD officially detected the start of speech. It is crucial for preventing the first syllable or word of an utterance from being cut off (e.g., ensuring "Okay, let's start" isn't transcribed as "kay, let's start").
</ParamField>

<ParamField body="vad.final_silence_s" type="float">
  Post-Utterance Silence. This specifies an extra duration of silence (in seconds) to append to the end of a finalized utterance before it's sent for transcription. This can sometimes provide the ASR model with more non-speech context, which can help in correctly punctuating the very end of a sentence, but it will directly add to the overall latency. Use with caution.
</ParamField>

### Advanced Filtering and Control Parameters

These parameters provide fine-grained control over what constitutes a valid utterance, helping to filter out noise and unwanted segments.

<ParamField body="vad.start_trigger_ms" type="integer">
  Minimum Speech to Start. An utterance will not begin until the VAD detects at least this many milliseconds of continuous speech. This helps prevent very short vocal tics or brief background noises from incorrectly starting a new transcription segment.
</ParamField>

<ParamField body="vad.min_voiced_ms" type="integer">
  Minimum Voiced Audio per Utterance. After an utterance is segmented, the system checks if it contains at least this much voiced audio. If not, the entire segment is discarded. This is a powerful tool for filtering out non-speech sounds like coughs, door slams, or clicks that might otherwise be long enough to form a segment.
</ParamField>

<ParamField body="vad.min_chars" type="integer">
  Minimum Transcript Character Length. After a segment is transcribed, the system checks the character count of the resulting text. If it's less than this value, the transcript is discarded. This is a post-processing filter useful for eliminating transcripts of filler words like "a" or "uh".
</ParamField>

<ParamField body="vad.min_words" type="integer">
  Minimum Transcript Word Count. Similar to min\_chars, but works on the word count. Setting this to 2 would discard single-word utterances (e.g., "Okay", "Right").
</ParamField>

<ParamField body="vad.amp_extend" type="integer">
  Non-Speech Utterance Extension. This allows a non-speech sound (below the threshold) to extend a currently active speech utterance for up to this many milliseconds. However, such a sound cannot start a new utterance. This is useful for cases where a speaker's voice trails off or a quiet background noise occurs mid-sentence, preventing the utterance from cutting off prematurely.
</ParamField>

<ParamField body="vad.force_decode_ms" type="integer">
  Maximum Utterance Duration. If this is set to a value greater than 0 (e.g., 15000 for 15 seconds), it acts as a safety net. The system will automatically finalize and transcribe an utterance after it reaches this duration, even if the speaker has not paused. This prevents infinitely long transcripts from users who don't pause naturally.
</ParamField>

<ParamField body="vad.debug" type="boolean">
  Enable Debug Logging. If set to true, the server will send additional diagnostic messages along with the transcripts. This is useful for development and fine-tuning VAD settings, but should be disabled in production.
</ParamField>

### VAD Configuration Examples

Here are three fully-configured START\_MSG examples tailored for different real-world situations.

<CardGroup cols={1}>
  <Card title="Scenario 1: General Purpose Dictation" icon="keyboard">
    This profile is balanced for users dictating notes or emails. It favors creating more complete sentences over instant responsiveness.

    ```python theme={null}
    START_MSG = {
        "type": "start",
        "sample_rate": 16000,
        "channels": 1,
        "single_utterance": False,
        "vad": {
            "threshold": 0.45,
            "min_silence_ms": 800,      # Longer silence to allow for pauses mid-sentence
            "speech_pad_ms": 200,
            "start_trigger_ms": 36,
            "min_voiced_ms": 100,
            "min_chars": 2,
            "min_words": 1,
            "force_decode_ms": 60000   # Force a transcript after 60s for long-winded speakers
        }
    }
    ```

    **Rationale:**

    * `min_silence_ms` is high (`800`) to give users time to think between clauses without breaking the sentence.
    * `force_decode_ms` is enabled as a fallback, ensuring that even if a user talks for a long time, they will get a transcript back every 20 seconds.
    * `threshold` is slightly more sensitive (`0.45`) for more natural-sounding dictation.
  </Card>

  <Card title="Scenario 2: Transcription in a Noisy Environment" icon="volume-high">
    This profile is hardened for use-cases like a call center or public kiosk where significant background noise is expected. The goal is to reject as much non-speech audio as possible.

    ```python theme={null}
    START_MSG = {
        "type": "start",
        "sample_rate": 16000,
        "channels": 1,
        "single_utterance": False,
        "vad": {
            "threshold": 0.65,          # High threshold to ignore background chatter
            "min_silence_ms": 400,
            "speech_pad_ms": 150,
            "start_trigger_ms": 100,    # Requires a clear, intentional start to speech
            "min_voiced_ms": 250,       # Very high requirement for voiced audio
            "min_chars": 1,
            "min_words": 1,
            "amp_extend": 0,            # Disable extension from quiet sounds
            "force_decode_ms": 0
        }
    }
    ```

    **Rationale:**

    * `threshold` is set high (`0.65`) to make the VAD less sensitive, effectively ignoring lower-volume background sounds.
    * `start_trigger_ms` and `min_voiced_ms` are both significantly increased to ensure only strong, clear speech from the primary speaker creates an utterance.
    * `min_chars` and `min_words` are increased to filter out partial words or phrases picked up from background conversations.
    * `amp_extend` is disabled (`0`) to prevent background hum or noise from incorrectly keeping an utterance alive.
  </Card>

  Please know that building with VAD is an iterative process, and giving your end user indirect control of these parameters can enhance the experience for their specific situation.
</CardGroup>
