JavaScript

Speech To Text With JavaScript

I've recently been playing with some options to automatically convert speech into text. My original plan was to do it on the back-end in the ASP.NET core server, but testing turned up a few issues which led me to look for alternatives.

Euan T

21 Jan 2024 • 2 min read

When testing System.Speech.Recognition.SpeechRecognitionEngine, I found that:

The API is only available on Windows.
The accuracy seemed to be pretty poor.

I therefore went looking for alternative options and stumbled across the Web Speech API. This API provides support for both speech recognition (speech to text) and speech synthesis (text to speech) in the web browser. My thinking was that I could convert the speech to text on the client side, write it into a hidden input and post that back to the back-end.

I'm most interested in the SpeechRecognition class in this post, which has quite an easy to use API that supports custom grammars, specifying the language to use for recognition, interim results and final results including degrees of certainty.

It's probably best to demonstrate with a quick code sample. You can preview it on CodePen too. Unfortunately I can't embed the CodePen here as the API is blocked when embedded in an iframe. This code sample uses React for state and event handling:

import React from "https://esm.sh/react";
import ReactDOM from "https://esm.sh/react-dom";

const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition || null;

(function(document, undefined) {
   const targetElement = document.getElementById("root");

   if (!targetElement) {
      console.error('failed to find root');

      return;
   }

   const root = ReactDOM.createRoot(targetElement);

   root.render(
      <React.StrictMode>
         <SpeechRecognizerDemo />
      </React.StrictMode>
   );
})(document);

function SpeechRecognizerDemo({}) {
   const [recognizeSpeech, setRecognizeSpeech] = React.useState(false);
   const [recognizedSpeech, setRecognizedSpeech] = React.useState('');

   React.useEffect(() => {
      let speechRecognizer = null;

      if (recognizeSpeech) {
         setRecognizedSpeech('');

         speechRecognizer = new SpeechRecognition();

         speechRecognizer.continuous = true;
         speechRecognizer.interimResults = false;
         speechRecognizer.maxAlternatives = 1;

         speechRecognizer.onerror = (err) => {
           console.error(err);
         };

         speechRecognizer.onresult = (event) => {
            setRecognizedSpeech('');

            for (const result of event.results) {
               if (result.isFinal) {
                  setRecognizedSpeech(oldValue => (oldValue + ' ' + result[0].transcript).trim());
               }
            }

            setRecognizedSpeech(oldValue => {
               return oldValue.replace(
                  /\S/,
                  c => c.toUpperCase()
               );
            })
         };

         speechRecognizer.start();
      }

      return () => {
         if (speechRecognizer) {
            speechRecognizer.stop();
         }
      };
   }, [recognizeSpeech]);

   if (!SpeechRecognition) {
      return <div className="alert alert-danger">
         Your web browser doesn't support speech recognition. Please try in a Chromium based browser or Safari.
      </div>
   }

   return <div className="container-fluid">
      <div className="row">
         <div className="col">
            <button type="button"
               className={`btn btn-sm btn-${recognizeSpeech ? 'danger' : 'success'}`}
               onClick={evt => {
                  evt.preventDefault();

                  setRecognizeSpeech(oldValue => !oldValue);

                  return false;
               }}>
               {recognizeSpeech ? 'Stop' : 'Start'}
            </button>
         </div>
      </div>

      <div className="row mt-2">
         <div className="col">
            <textarea className="form-control"
               readonly
               value={recognizedSpeech}></textarea>
         </div>
      </div>
   </div>
}

My initial testing looked pretty positive, but there are some (fairly serious) downsides:

Browser support is fairly limited - it'll work in most Chromium based browsers and Safari. It won't work in Firefox at all though. Where it is supported, it's supported with a vendor prefix.
- I've tested on Google Chrome, Safari, and Microsoft Edge and it's worked on all 3.
- I've found some references saying that Chromium based browsers send the captured speech to a Google web service to perform the speech recognition, meaning you have to rely on an external 3rd party.

I'm hoping that Firefox will add support fairly soon. For now I'm looking at providing this as an option hidden behind a setting labeled as a preview so that end-users can opt-in if they wish to.