Zach Bennoui

2 years ago

Zach Bennoui
2 years ago

Over the past year, I've been experimenting with neural text to speech in various forms. I have done hours of experimentation and research, training models and getting varying results along the way. Some of you may have heard of Piper, an open source synthesizer and add on for NVDA that can be trained by anyone. It is currently in active development, and I have been there from the beginning, testing and evaluating the various versions. For years, I have had a goal to create a high-quality voice that is truly usable by a screen reader user, and yesterday I managed to achieve this. I'm really excited to share Alba, a female Scottish English voice. I'm considering this a beta phase, and I'm looking for feedback to make improvements as needed. Please note that you will most likely get an error upon installation, however the voice should still show up to NVDA, and I'm working on fixing this as soon as possible.
Link to Piper: github.com/rhasspy/piper/tree/…
Link to addon: github.com/mush42/piper-nvda?r…
Link to Alba: drive.google.com/file/d/1wZHuI… #TTS #AI #ScreenReader #Piper

GitHub - mush42/piper-nvda at building.open-home.io

This add-on implements a speech synthesizer driver for NVDA using [Piper](https://github.com/rhasspy/piper). - GitHub - mush42/piper-nvda at building.open-home.io

^GitHub

in reply to Zach Bennoui

Pratik Patel

in reply to Zach Bennoui 2 years ago

@KaraLG84 It's an excellent new voice. I just wish the engine were a little more performant. I'd love to use this voice for my day-to-day needs.

@Kara Goldfinch

in reply to Pratik Patel

Zach Bennoui

in reply to Pratik Patel 2 years ago

@ppatel Thank you so much, that really means a lot. I completely agree with you, responsiveness is not there yet, but it is currently being worked on. I'm hoping in time we will get to a point where it's quite usable, but even now I'm surprised that it works at all, considering that ML speech synthesis has been restricted to the cloud up until fairly recently. I also want to stress that this was very easy to train, and I plan on creating a guide to help people make their own models in almost any language they would like.

@Pratik Patel

in reply to Zach Bennoui

Pratik Patel

in reply to Zach Bennoui 2 years ago

Oh that guide would be incredibly helpful. Thank you for doing this.

in reply to Pratik Patel

Zach Bennoui

in reply to Pratik Patel 2 years ago

@ppatel Absolutely, I'm excited to do it! Just to give some perspective, I took about 1200 audio files from a dataset by CSTR, Downsampled them to 22,050HZ wave files, transcribe them using open AI whisper and put them into the correct format, then trained for about three hours. I did not have any input during the training process, this is the raw result from Piper.

@Pratik Patel

in reply to Zach Bennoui

Pitermach

in reply to Zach Bennoui 2 years ago

@ppatel Out of curiosity how much material did you need for this (as in how many hours was your training data) and what computer did you use for training? Got a friend who is interested in trying to train a Polish voice as a test, perhaps using the raw data we have from training Polish RHVoices.

@Pratik Patel

in reply to Pitermach

Zach Bennoui

in reply to Pitermach 2 years ago

@pitermach Hi Piotr, I only used around 59 minutes of training audio. The entire CSTR dataset is around four hours, but I wanted to start with less just to see how it would perform. The great thing about machine learning is depending on the model architecture you choose you can get away with very little data and still get good results, so it's great for low resource languages. I unfortunately do not have an Nvidia GPU to take advantage of, I'm on macOS, so I used Google Colab pro to train the model, before exporting it to the onyx runtime to work with Piper.

@Pitermach

in reply to Zach Bennoui

Pitermach

in reply to Zach Bennoui 2 years ago

All right, thanks. Sounds like training it on RHVoice data might be very practical then because the scripts they use for that are about 2 hours of audio. I'll definitely look forward to your guide then because the voice you generated sounds great

in reply to Pitermach

Zach Bennoui

in reply to Pitermach 2 years ago

@pitermach Great to hear, guide is coming soon.

@Pitermach

in reply to Zach Bennoui

Mikołaj Hołysz

in reply to Zach Bennoui 2 years ago

Have you experimented with RHVoice? A few people got it to work with Polish, and it has been a huge success here. The English voices aren’t terribly amazing, but it’s nothing that can’t be overcome. It’s a pain in the ass to build but the instructions are there, and if you find/record a couple hours of high-quality audio, you can make a really decent, responsive voice.

in reply to Mikołaj Hołysz

Zach Bennoui

in reply to Mikołaj Hołysz 2 years ago

@miki Yes, I attempted to train Rh voice models a few months ago, however the process is quite finicky and hard to get good results with. HMM based TTS seems to be incredibly particular about the quality of your audio, and even if the speech is clean you can still have problems. It also seems to take a very long time to train, and I at least was unable to get anything usable out of it. I may try again in the future, but for now ML is my main focus as I believe this is a far superior technology currently. If someone could maybe do a demo of training RH, that would be incredibly helpful as I would like to try it again at some point.

@Mikołaj Hołysz

in reply to Zach Bennoui

Mikołaj Hołysz

in reply to Zach Bennoui 2 years ago

This doesn’t seem to run on Apple Silicon (probably due to missing SIMD extensions, which are still covered by Intel patents I believe.)

in reply to Mikołaj Hołysz

Zach Bennoui

in reply to Mikołaj Hołysz 2 years ago

@miki Someone got it to work. See here. github.com/rhasspy/piper/pull/…

Fix compilation and runtime for M1 Mac by jreese42 · Pull Request #85 · rhasspy/piper

This cleans up a couple of minor issues using Piper on an M1 Mac. In CMakeLists.txt, two of the flags cannot be used because Mac OS's gcc is actually clang. /proc/self/exe also does not exist, so...

^GitHub

@Mikołaj Hołysz

in reply to Zach Bennoui

Mikołaj Hołysz

in reply to Zach Bennoui 2 years ago

This is for Piper running on native Apple Silicon (compiled for ARM), not for emulating X86

in reply to Mikołaj Hołysz

Zach Bennoui

in reply to Mikołaj Hołysz 2 years ago

@miki I'm a little confused. What are you trying to do?

@Mikołaj Hołysz

in reply to Zach Bennoui

Mikołaj Hołysz

in reply to Zach Bennoui 2 years ago

I’m running NVDA on Windows in a virtual machine, Parallels.

in reply to Mikołaj Hołysz

Zach Bennoui

in reply to Mikołaj Hołysz 2 years ago

@miki Oh I understand now. I have an m1 pro Mac here, however I use my intel machine to test as there is no synth driver for VoiceOver yet. I'm not really sure how to help unfortunately.

@Mikołaj Hołysz

in reply to Zach Bennoui

Mikołaj Hołysz

in reply to Zach Bennoui 2 years ago

I mainly posted this in case others wanted to know, not to complain. It makes your NVDA go completely silent, so I think it’s worth knowing about

⇧

Zach Bennoui

Zach Bennoui 2 years ago • •

Zach Bennoui
2 years ago