The following features are available in the Asynchronous Speech-to-Text and Streaming Speech-to-Text APIs.

Custom vocabularies

To improve the accuracy of the ASR when using words or terms that are not in the average English dictionary, submit these words as custom vocabulary. Custom vocabularies are submitted as a list of phrases. A phrase can be one word or multiple words, usually describing a single object or concept. Here is an example of submitting a custom vocabulary to the API containing the made-up word sparkletini:

curl -X POST "<https://sttgw.voiceloft.com/api/v1/vocabularies>" \\
    -H "Authorization: <TOKEN>" \\
    -H "Content-Type: application/json" \\
    -d '{ "custom_vocabularies": [{ "phrases": ["sparkletini"] }] }'

Punctuation and inverse text normalization

Voiceloft automatically adds punctuation and performs inverse text normalization on all audio processed. Inverse text normalization or ITN is the process of converting spoken-form text to written-form text. This includes dates, times and phone numbers.

Examples:

ITN is performed on all audio submitted to the Asynchronous Speech-to-Text API. For audio submitted to the Streaming Speech-to-Text API, ITN is only performed on Final Hypotheses.

Here is an example of a transcript containing punctuation:

{
  "monologues": [
    {
      "speaker": 1,
      "elements": [
        {
          "type": "text",
          "value": "Hello",
          "confidence": 1
        },
        {
          "type": "punct",
          "value": " "
        },
        {
          "type": "text",
          "value": "World",
          "confidence": 0.8
        },
        {
          "type": "punct",
          "value": "."
        }
      ]
    },
    {
      ...
    }
  ]
}

Disfluency or filler word removal

Disfluencies can be distracting because they break the flow of speech. This is especially true for written text. The APIs currently only filter for "ums" and "uhs" but when this setting is enabled, disfluencies will not appear in the transcription output.

Profanity filtering

The current profanity dictionary contains approximately 600 profane words and phrases. When this feature is enabled, all the words transcribed that are included on this list will be displayed as asterisks except the first and last character.

Timestamps

The JSON transcription output includes timestamps for every transcribed word. Timestamps correspond to when the words are spoken within the audio and can be used for alignment, analytics, live captions, etc.

Here is an example of a transcript with timestamps:

{
  "monologues": [
    {
      "speaker": 0,
      "elements": [
        {
          "type": "text",
          "value": "Hello",
          "ts": 0.66,
          "end_ts": 0.84,
          "confidence": 0.84
        },
        {
          "type": "text",
          "value": "World",
          "ts": 0.84,
          "end_ts": 1.05,
          "confidence": 0.99
        },
        {
          ...
        }
      ]
    }
  ]
}