Skip to main content


Over the past year, I've been experimenting with neural text to speech in various forms. I have done hours of experimentation and research, training models and getting varying results along the way. Some of you may have heard of Piper, an open source synthesizer and add on for NVDA that can be trained by anyone. It is currently in active development, and I have been there from the beginning, testing and evaluating the various versions. For years, I have had a goal to create a high-quality voice that is truly usable by a screen reader user, and yesterday I managed to achieve this. I'm really excited to share Alba, a female Scottish English voice. I'm considering this a beta phase, and I'm looking for feedback to make improvements as needed. Please note that you will most likely get an error upon installation, however the voice should still show up to NVDA, and I'm working on fixing this as soon as possible.
Link to Piper: github.com/rhasspy/piper/tree/…
Link to addon: github.com/mush42/piper-nvda?r…
Link to Alba: drive.google.com/file/d/1wZHuI… #TTS #AI #ScreenReader #Piper
in reply to Zachary Bennoui

@KaraLG84 It's an excellent new voice. I just wish the engine were a little more performant. I'd love to use this voice for my day-to-day needs.
in reply to Pratik Patel

@ppatel Thank you so much, that really means a lot. I completely agree with you, responsiveness is not there yet, but it is currently being worked on. I'm hoping in time we will get to a point where it's quite usable, but even now I'm surprised that it works at all, considering that ML speech synthesis has been restricted to the cloud up until fairly recently. I also want to stress that this was very easy to train, and I plan on creating a guide to help people make their own models in almost any language they would like.
in reply to Zachary Bennoui

Oh that guide would be incredibly helpful. Thank you for doing this.
in reply to Pratik Patel

@ppatel Absolutely, I'm excited to do it! Just to give some perspective, I took about 1200 audio files from a dataset by CSTR, Downsampled them to 22,050HZ wave files, transcribe them using open AI whisper and put them into the correct format, then trained for about three hours. I did not have any input during the training process, this is the raw result from Piper.
in reply to Zachary Bennoui

@ppatel Out of curiosity how much material did you need for this (as in how many hours was your training data) and what computer did you use for training? Got a friend who is interested in trying to train a Polish voice as a test, perhaps using the raw data we have from training Polish RHVoices.
in reply to Pitermach

@pitermach Hi Piotr, I only used around 59 minutes of training audio. The entire CSTR dataset is around four hours, but I wanted to start with less just to see how it would perform. The great thing about machine learning is depending on the model architecture you choose you can get away with very little data and still get good results, so it's great for low resource languages. I unfortunately do not have an Nvidia GPU to take advantage of, I'm on macOS, so I used Google Colab pro to train the model, before exporting it to the onyx runtime to work with Piper.
in reply to Zachary Bennoui

All right, thanks. Sounds like training it on RHVoice data might be very practical then because the scripts they use for that are about 2 hours of audio. I'll definitely look forward to your guide then because the voice you generated sounds great
in reply to Zachary Bennoui

Have you experimented with RHVoice? A few people got it to work with Polish, and it has been a huge success here. The English voices aren’t terribly amazing, but it’s nothing that can’t be overcome. It’s a pain in the ass to build but the instructions are there, and if you find/record a couple hours of high-quality audio, you can make a really decent, responsive voice.
in reply to Mikołaj Hołysz

@miki Yes, I attempted to train Rh voice models a few months ago, however the process is quite finicky and hard to get good results with. HMM based TTS seems to be incredibly particular about the quality of your audio, and even if the speech is clean you can still have problems. It also seems to take a very long time to train, and I at least was unable to get anything usable out of it. I may try again in the future, but for now ML is my main focus as I believe this is a far superior technology currently. If someone could maybe do a demo of training RH, that would be incredibly helpful as I would like to try it again at some point.
in reply to Zachary Bennoui

This doesn’t seem to run on Apple Silicon (probably due to missing SIMD extensions, which are still covered by Intel patents I believe.)
in reply to Mikołaj Hołysz

@miki Someone got it to work. See here. github.com/rhasspy/piper/pull/…
in reply to Zachary Bennoui

This is for Piper running on native Apple Silicon (compiled for ARM), not for emulating X86
in reply to Mikołaj Hołysz

@miki Oh I understand now. I have an m1 pro Mac here, however I use my intel machine to test as there is no synth driver for VoiceOver yet. I'm not really sure how to help unfortunately.
in reply to Zachary Bennoui

I mainly posted this in case others wanted to know, not to complain. It makes your NVDA go completely silent, so I think it’s worth knowing about