Over the past year, I've been experimenting with neural text to speech in various forms. I have done hours of experimentation and research, training models and getting varying results along the way. Some of you may have heard of Piper, an open source synthesizer and add on for NVDA that can be trained by anyone. It is currently in active development, and I have been there from the beginning, testing and evaluating the various versions. For years, I have had a goal to create a high-quality voice that is truly usable by a screen reader user, and yesterday I managed to achieve this. I'm really excited to share Alba, a female Scottish English voice. I'm considering this a beta phase, and I'm looking for feedback to make improvements as needed. Please note that you will most likely get an error upon installation, however the voice should still show up to NVDA, and I'm working on fixing this as soon as possible.
Link to Piper: github.com/rhasspy/piper/tree/…
Link to addon: github.com/mush42/piper-nvda?r…
Link to Alba: drive.google.com/file/d/1wZHuI… #TTS #AI #ScreenReader #Piper
in reply to Pratik Patel

@ppatel Thank you so much, that really means a lot. I completely agree with you, responsiveness is not there yet, but it is currently being worked on. I'm hoping in time we will get to a point where it's quite usable, but even now I'm surprised that it works at all, considering that ML speech synthesis has been restricted to the cloud up until fairly recently. I also want to stress that this was very easy to train, and I plan on creating a guide to help people make their own models in almost any language they would like.
in reply to Pratik Patel

@ppatel Absolutely, I'm excited to do it! Just to give some perspective, I took about 1200 audio files from a dataset by CSTR, Downsampled them to 22,050HZ wave files, transcribe them using open AI whisper and put them into the correct format, then trained for about three hours. I did not have any input during the training process, this is the raw result from Piper.
in reply to Pitermach

@pitermach Hi Piotr, I only used around 59 minutes of training audio. The entire CSTR dataset is around four hours, but I wanted to start with less just to see how it would perform. The great thing about machine learning is depending on the model architecture you choose you can get away with very little data and still get good results, so it's great for low resource languages. I unfortunately do not have an Nvidia GPU to take advantage of, I'm on macOS, so I used Google Colab pro to train the model, before exporting it to the onyx runtime to work with Piper.
in reply to Zach Bennoui

Have you experimented with RHVoice? A few people got it to work with Polish, and it has been a huge success here. The English voices aren’t terribly amazing, but it’s nothing that can’t be overcome. It’s a pain in the ass to build but the instructions are there, and if you find/record a couple hours of high-quality audio, you can make a really decent, responsive voice.
in reply to Mikołaj Hołysz

@miki Yes, I attempted to train Rh voice models a few months ago, however the process is quite finicky and hard to get good results with. HMM based TTS seems to be incredibly particular about the quality of your audio, and even if the speech is clean you can still have problems. It also seems to take a very long time to train, and I at least was unable to get anything usable out of it. I may try again in the future, but for now ML is my main focus as I believe this is a far superior technology currently. If someone could maybe do a demo of training RH, that would be incredibly helpful as I would like to try it again at some point.
in reply to Mikołaj Hołysz

@miki Someone got it to work. See here. github.com/rhasspy/piper/pull/…