Voice-enabled Web Apps: Introduction to the Speech Synthesis API

clock icon July 1, 2019

In my previous blog post, I looked into Web Workers and how they can be used to execute long-running tasks in the background without affecting the end-user experience. In this article I’ll be looking into another browser API and using it to build an app!


Voice-driven applications are very common these days. From digitals assistants like Siri and Alexa on our smartphones to smart homes, voice is playing a major role in modern applications. With that in mind, there are a few APIs that developers can leverage to add voice-driven functionality to their web apps, which are baked directly into the browser. They are all part of the Web Speech API.


Web Speech API

The Web Speech API adds voice capabilities like text-to-speech and speech-to-text to the browser. It is made up of two APIs: Speech Recognition API (speech-to-text) and Speech Synthesis API (text-to-speech). This article focuses on the latter as the former has not been gotten enough adoption by browser vendors.


Speech Synthesis API Interface

The Speech Synthesis API takes in text input and reads out to the user. The voice, and it’s properties like pitch, volume, language, etc can be altered. The basic usage involves passing a SpeechSynthesisUtterance to speechSynthesis.speak():

const speech = new SpeechSynthesisUtterance(‘Hello world’);
window.speechSynthesis.speak(speech);

If the browser supports speech synthesis, you would hear “Hello world”. The speech could also be altered to change some parameters like volume, pitch, rate, voice:

const voice = window.speechSynthesis.getVoices();
speech.voice = voices[3];
speech.volume = 0.8; // 0 to 1
speech.rate = 1; // 0.1 to 10
speech.pitch = 2; // 0 to 2

The speech can also be paused and resumed:

window.speechSynthesis.pause(); // pause speech
window.speechSynthesis.resume(); // resume speech

This is a general overview of the API and covers most of the basics. How can this be used in a web app? I built a PDF file reader using the Speech Synthesis API and Web workers to see how the two can work together. The app basically converts any PDF file to text that is read aloud. The UI is made with Preact because I wanted to keep the size down (the main app is under 30kb + the PDF processing library - PDFJS - which is just under 300kb compressed). It’s a PWA (Progressive Web App) so it can work fully offline on repeat usage. You can find the details of the structure of the app are explained below.


Assumptions

I made a few assumptions about the PDF input:

  • PDFs contain just text. It only works for the text in the PDF so if it encounters an image, nothing happens.
  • Image PDFs. Some PDFs contain images which contain text. I also did not handle this case which would require an image processing library which is outside the scope of my use case.


Components of the App

https://res.cloudinary.com/codehacks/image/upload/v1561983653/kncdjnmle.png

Different app components


As can be seen in the image above, the main thread contains the app logic, built with preact. There are 2 main components: <Intro /> and <Podcast />. <Intro /> contains the logic to drag and drop or select a PDF file. When a PDF file is selected, the file data is read using the FileReader API and stored as a base64 string in localstorage. The <Intro /> component then routes to <Podcast />


In <Podcast />, the base64 string is read from localstorage and passed to a function that converts it to a UInt8Array. This is done because PDFJS accepts either a URL or UInt8Array. Processing/manipulating PDF files is a task that could consume CPU resources and take time, especially on low end devices so it seems like a good place for web workers. Fortunately, PDFJS does it’s core operations in a web worker. Once the PDF file has been processed, the response is an array of arrays which contain the contents of the file. The result is looped through to get each page’s content and pass to another web worker. The new worker loops through it, extracts the raw text, and returns it as a string to the main thread. Communication between the main thread and web worker is done through postMessage().


On the main thread, the string is used to create a SpeechSynthesisUtterance and read to the user. The Speech Synthesis API uses queues so any new SpeechSynthesisUtterance is added to the queue and read when it gets to its turn. Here is the app below:

https://res.cloudinary.com/codehacks/image/upload/v1561983667/cdnkmd.gif


You can find the live demo on netlify and the source code on github. Thanks for reading!