Losing your larynx does not have to mean losing your own voice

Those who are at risk of losing their voice, or know in advance that they will lose it, will now be able to benefit from an automated voice banking and reconstruction system. This is currently being developed by researchers at the New Technologies for the Information Society (NTIS) Research Centre, which is part of the Faculty of Applied Sciences, University of West Bohemia in Plzeň. As early as next year, a computer program could be able to learn to reproduce patients’ speech, says project leader Jindřich Matoušek in the interview.

Total laryngectomy is a cancer treatment procedure that involves removing the whole of the larynx, including the vocal cords, which means that patients lose their voice. While there are special devices, such as an electrolarynx, that can give them their voice back, the actual voice they produce sounds artificial and impersonal. The Automatic Voice Banking and Reconstruction project aims to help these patients by allowing them to conserve or “bank” their voice using voice recordings that can be made in the comfort of their own home. The rest will be done by an automated software program.

The researchers from Plzeň are currently working on the project together with the First Faculty of Medicine at Charles University in Prague and the commercial companies Certicon and SpeechTech, while the funding is provided by the Technology Agency of the Czech Republic. Jindřich Matoušek, the project leader and a member of the research team at the Department of Cybernetics and the NTIS Research Centre, says that while it is already possible to create a reproduction of anybody’s voice, the process still requires input from human experts. The project is scheduled to end in 2020, by which time the process should be fully automated.

Why did you decide to start a voice banking project?
While commercial speech synthesis solutions that use the voices of professional speakers are now widely available, it’s just not the same for many people. We wanted to give them the option to retain their own voice, even though the quality might not be as good as with a professional speaker. People who are not experts in the field don’t mind that and they are happy when they hear their own voice and their family and friends can recognise it as theirs. I think this can give a huge mental boost during the difficult time immediately after the operation. Speech synthesis as such is nothing new. We moved from mechanical synthesisers to electronic and computer synthesisers, but what remained was that you had to put in a lot of manual work to produce a good sounding voice. It was a difficult and time-consuming process that could take months or even years, so you could not really use it to create a voice for anybody. Our programme will only need a few hours.

The project started in 2017 and ends in 2020. How much progress have you made?
At the moment, we are essentially able to create an individualised voice for anyone and we are experimenting with automatic processing, which is the main objective of the project. If you came to us now and said you want your voice recorded, we could do this in our anechoic chamber, where we have the right acoustic conditions and equipment, or, if you were pressed for time, we could give you instructions on how to record your voice at home. We would explain that you should use a neutral rather than emotional voice but should not sound monotonous or robotic – just imagine you are passing on a piece of information. If you could successfully record 500 to 1,000 sentences, this would be an excellent foundation for us to work with and create a package with your voice that would be easily recognisable. However, it would be a job for several people who would have to go through and manually process the files with the recordings, so it would take days rather than hours.

If you want to have your voice banked, what sentences do you have to record? Do you give people a standard text?
If they just read any text, it might lack some of the important speech phenomena and we could not guarantee a good result, so we had to find a way around that. Speech synthesis is designed to turn any text to speech, even a text that makes no sense, so you need a data package that includes as many elements of speech, phonetics and intonation as possible. For this reason, the sentences are chosen by a special algorithm to include a wide variety of phonetic and prosodic features. Prosodic features can change the meaning of a sentence, for example, whether it is a statement or a question, so it is not only about properly recording things such as the phone “a” but recording it properly at the beginning and the end of a sentence, where the sound is different. Plus, at the end of a sentence, there is a difference between a statement and a question – in Czech, the intonation would normally fall at the end when it’s a statement and would rise when it’s a question. We started with a million sentences and developed a scoring algorithm that helped us arrive at 3,500 sentences, which we think is the maximum number of sentences that our patients, who are not professional speakers, would be able to record. In other scenarios that use professional speakers, it is quite normal to have 10,000 or 20,000 sentences, sometimes even more.

Is it because it takes time that patients do not have?
Yes. We had a patient who recorded his voice on Saturday and had his vocal cords removed on Monday. When you have cancer, time is everything and you have to move quickly, so a patient might learn that they have to undergo the procedure within a week, but they also learn that they have the option of having their voice banked. Obviously, they can record fewer than the 3,500 sentences; in fact, this is usually the case.

But that will presumably make the reproduced voice sound less than perfect.
If they can record 3,000 sentences, that’s great; 1,000 is still okay and even if they only record 300, we can still work with that.

How long does it take to make the recordings?
All the members of our team tested this on themselves. In my case, it took me three three-hour recording sessions to record 1,700 sentences. If you have no problem speaking and you have the time, you should arrive at a similar number. In other words, if you know that you are scheduled for a total laryngectomy or another throat procedure and you can devote one afternoon to this, you can record something like 600 sentences. The more, the better, but at the same time, the recording session should not be too long because your voice gets tired.

How many voices have you already recorded?
Around 50 in total, I think. Some 30 voices were actual patients and the rest were other non-professional speakers, such as our colleagues at the department. One of the patients who lost their voice now uses the speech synthesis of his voice in his profession – he works as a psychologist. His was one of the first recordings we made.

You mentioned that doctors inform their patients about the possibility to bank their voice.
One of the tasks of the Faculty of Medicine at Charles University in Prague is to raise awareness of our project in the medical community and educate the patients. Our project is also discussed at medical congresses and research conferences in the Czech Republic and abroad. However, some of the people interested in voice banking also contact us directly when they learn about the project from the media.

What happens once you have recorded the voice?
Until recently, speech synthesis was based on unit selection, which is a very intuitive process. Let’s say we have recorded a thousand sentences, but now we need to synthesise the speech, that is to create a completely new sentence. In order to do this, we first have to break down the words of our new sentence into smaller units, such as phones. The algorithm will then find the corresponding “pieces” of speech and put them in a sequence. If the word we are trying to create is ahoj (the Czech for “hello”), the algorithm will find the respective phones, “a”, “h”, “o” and “j”, cut their signal out from the original recordings and join them up together in a chain – the technical word for that is concatenating, which is why we call this concatenative synthesis. However, individual phones sound different in different contexts – an “a” will sound different when it follows a “p” compared to when it follows an “m”. It also depends on whether the phone is at the beginning or the end of a word, whether the word is at the beginning of a sentence, at the end of a sentence before a pause and so on. Unit selection simply means that we are looking for the best “a”, “h”, “o” and “j” out of all the recordings that we have.

And presumably, you are also looking for an “a” at the beginning of a word.
That’s one of the criteria, but it’s even better if there is an “h” after the “a” in the word that we take it from because that also changes how it sounds. This whole process is called unit selection synthesis and is a special case of concatenative synthesis. It is probably still the most widely used commercial method, but the trouble is that you need a very large set of high-quality acoustic data to be able to do this, which is a problem, especially since you are working with non-professional speakers. This is why there is now another method that can be used and that’s statistical parametric synthesis. It is currently used almost exclusively in conjunction with neural network models and looks very promising. You don’t need as much data to arrive at a synthetic speech with reasonable quality, so it is more suitable when working with patients who might not have the time or energy to record a large amount of text. When a patient only manages to record 200 or 300 sentences, we can mix their neural models with the models based on other speakers, even professional ones, and use general statistical characteristics of speech to reconstruct their voice. The result will sound somewhat different from the voice produced by concatenative synthesis and our goal is to use both methods and let the users choose between them.

So, once the project is finished, voice synthesising will be automated and accessible to everyone and users will also have the option to choose their preferred method. How will that work?
We are developing a web portal, uchovejhlas.cz, which will make that possible. It is currently in testing mode but the voice recording feature is not yet live. Once we make it available, users will be able to register and log in to a website that will serve as an internet recording interface with user-friendly buttons such as Record, Stop, Try Again and so on. Users will read sentences that will appear on the screen. Since everything has to be automated, the system will also review the users’ output, tell them whether the sentences were recorded correctly and let them know about any problems. The system will then automatically process the recordings without the need for any human input and the user will download the resulting data package.

What kind of device will be required for this?
You won’t need anything more sophisticated than a computer or a smartphone with the right application. And although our system is primarily aimed at people who lose their voice, voice banking can be used by anyone else with voice issues as well as geeks who might want devices such as answering machines to speak to other people with their own voice.

Helping people who have lost their voice is only one of the applications of text-to-speech or TTS programmes. Where else are they used? I know that they are used to give you the exact time over the telephone.
An interesting application of speech synthesis that we developed together with SpeechTech are the announcements that you can hear on the public transport in Prague, such as on the metro. These are not the metro stop announcements, those were recorded by a human speaker, but announcements about things like incidents, road closures and replacement transportation services – in other words, announcements that cannot easily be recorded in advance. In general, the TTS technology has made great progress in the past ten years and is now a standard feature of every smartphone, smart home assistants and GPS navigation systems, where it is used to read street names and other text. It is also beginning to be used to produce audiobooks.

How about the text-to-speech rendering that is now available on the websites of some municipalities?
That is our system as well and people with visual impairment are another large group of people that can benefit from speech synthesis. They can let the website read the text out loud to them and use speech synthesis to read emails or any content showing on the screen using a special screen reader application. There is a primary school in Plzeň for children with visual impairment where we helped create electronic textbooks as teaching aids for the children. The teachers, who know the children best, prepared lesson summaries in an interface similar to Word, and they wrote as though they were speaking to the children. We then added a specially adapted speech synthesis module. The students could then log in to ucebnice.zcu.cz, find the relevant textbook and the speech synthesis would read the text out loud to them, describe the pictures and help them understand the content. The respective sections of the text were also highlighted so that the children would know which part is currently being read to help them navigate the text. It was an interesting project that ended about four years ago and the aids are still being used.

In other words, speech synthesis is becoming an everyday part of our lives.
Which makes it even more important to keep on improving the quality of synthetic speech, because even though it has improved significantly, especially over the last ten years, there is still a number of applications where the current quality and, most importantly, the naturalness of synthetic speech is insufficient.