Get Started

This short tutorial will teach you the basics of using the Streaming Speech-to-Text API. It demonstrates how to produce a transcript of an audio stream in real-time.

Assumptions

This tutorial assumes that you have an access token.

Establishing a Websocket Connection

To connect with the real-time endpoint, you must use a WebSocket client and establish a connection with  wss://sttgw.voiceloft.com/ws/v1/upload .

Authentication is handled via the authorization header. The value of this header should be your access token.

Once your request is authorized and the connection established, your client will receive the following JSON data:

{
   "session_id":"<ID>",
   "type":"connected"
}

Each time to send an audio file you should connect to the client.

Submit an audio stream for transcription and retrieve the result

When sending audio over the WebSocket connection, you should send a JSON payload with one of the following parameters.

{
   "audio_data": "..."
}

Audio_data is a base64 encoded audio file. Base64 encoding is a simple way to encode your raw audio data so that it can be included as a JSON parameter in your websocket message.

For sending audio via Url, clients should send a JSON message with the following field:

{
   "url": ".."
}

Url is the URL to your own audio file.

Responses

All transcript responses from the Streaming Speech-to-Text API are text messages and are returned as serialized JSON. The transcript response has two states: partial hypothesis and final hypothesis.

While clients are streaming audio data, API processes and returns partial hypotheses. Partial hypotheses are the AI's best guess of what was said up to that moment in time.