Skip to main content


With all the neat AI stuff these days, is there a service that takes a person's voice and turns it into a SAPI5 synthesizer? I know companies like LyreBird are working on the voice part, but they never let you use the generated voice outside of their website.
in reply to Alex Hall

@queenslight RH Voice is not AI-based but if you can set up it right, it is possible to train a model on a bunch of recordings associated with sentences of prepared script and generate anything RH Voice works on: SAPI 5, NVDA Addons, Android APK's, Linux packages etc. I'm sort of surprised this hasn't caught on more in the community. I'll try to dig up a tutorial someone in Poland wrote on how to set up things properly for it to work.
in reply to Paweł Masarczyk

@Piciok @queenslight Really? That's neat. I tried RHVoice, and it wasn't bad. I didn't know it could be trained.
in reply to Alex Hall

Here is the walk through on creating new voice for #RHVoice https://github.com/RHVoice/RHVoice/wiki/Creating-a-new-voice-for-RHVoice.
If you are adding new voice for existing language that might not be that difficult to do I guess.
However I my-self have difficulties configuring phoneme translation for slovak and I am just on the beginning with my ultimate goal on creating #slovak voice.
in reply to Alex Hall

@pvagner @Piciok @queenslight They’re working on a way to do this in Colab apparently, which will be a lot simpler.
in reply to Alex Hall

@pvagner @Piciok @queenslight It’s a Google run service for training AI models and playing with all kinds of machine learning algorithm, free as long as you don’t overuse it. It’s based on Jupyter, but as far as I’ve heard, the fork they’re running is more accessible than the open source version.
in reply to Mikołaj Hołysz

@miki @pvagner @Piciok

Correct. It is quite accessible to screen readers. Admittedly best with Chrome, though Firefox ain't too bad with it.
in reply to Alex Hall

RH Voice can do it, but it’s a bitch and a half to train and English still needs some work on the phonetic transcription side. Works well enough for most Eastern-european languages though.
in reply to Mikołaj Hołysz

@miki I didn't see anything about training RHVoice. I installed their English voices a while ago, and one is pretty good. If one can train it, how is it still bad at English?
in reply to Alex Hall

RH Voice isn’t an end-to-end model, like Tacotron / Flowtron / Wavenet / whatever Apple is using. It roughly consists of three parts, text-to-phoneme transcription, which is entirely rules based, audio parameter estimation, which takes the phonemes as input and outputs what the audio should “look like”, in terms of fundamental frequencies and other such things, and audio synthesis, which takes these audio parameters and turns them into raw PCM samples, the kind you store in a wav file or send to your sound card. The first and third phases are just code, with the statistical, trained model in the middle. The first part is language-specific, the second is voice-specific, and the third is basically the same for all voices, minus some EQ settings. Training consists of transcribing your text to phonemes, calculating the parameters that the model should output, basically reversing the synthesis process and going from wav to parameters, and then making your model predict the right parameters in the right situations.
in reply to Mikołaj Hołysz

@miki That sounds both hard and confusing, especially for someone like me who lacks any real understanding of the details.
in reply to Alex Hall

I on’t really understand the math either. Most of it is handled automatically behind the scenes, the only thing you need to know is that text processing (whether plugin is pronounced like plug in or like ploo gin) is controlled programmatically, via language rules, and whether the voice sounds male, female, old, young, like you or like me is based on the training data.
in reply to Mikołaj Hołysz

@miki Can anyone do this? It seems like it should be pretty language-agnostic, if someone can adjust things.
in reply to Alex Hall

Do what? Change the language rules? Yeah, they use something called Foma for doing the processing, so if you get the hang of the syntax it uses, you can do whatever. Getting all the linguistic knowledge to figure out what the rules should be is a different problem entirely. English is notoriously difficult in that aspect, see E.G. the o in woman vs. the o in women, the th in three VS. the th in though VS. the th in lighthouse. I haven’t used RH Voice extensively enough to know how much of that work is already done and how many elusive exceptions remain.