in reply to Alex Hall

@queenslight RH Voice is not AI-based but if you can set up it right, it is possible to train a model on a bunch of recordings associated with sentences of prepared script and generate anything RH Voice works on: SAPI 5, NVDA Addons, Android APK's, Linux packages etc. I'm sort of surprised this hasn't caught on more in the community. I'll try to dig up a tutorial someone in Poland wrote on how to set up things properly for it to work.
in reply to Alex Hall

RH Voice isn’t an end-to-end model, like Tacotron / Flowtron / Wavenet / whatever Apple is using. It roughly consists of three parts, text-to-phoneme transcription, which is entirely rules based, audio parameter estimation, which takes the phonemes as input and outputs what the audio should “look like”, in terms of fundamental frequencies and other such things, and audio synthesis, which takes these audio parameters and turns them into raw PCM samples, the kind you store in a wav file or send to your sound card. The first and third phases are just code, with the statistical, trained model in the middle. The first part is language-specific, the second is voice-specific, and the third is basically the same for all voices, minus some EQ settings. Training consists of transcribing your text to phonemes, calculating the parameters that the model should output, basically reversing the synthesis process and going from wav to parameters, and then making your model predict the right parameters in the right situations.
in reply to Alex Hall

I on’t really understand the math either. Most of it is handled automatically behind the scenes, the only thing you need to know is that text processing (whether plugin is pronounced like plug in or like ploo gin) is controlled programmatically, via language rules, and whether the voice sounds male, female, old, young, like you or like me is based on the training data.
in reply to Alex Hall

Do what? Change the language rules? Yeah, they use something called Foma for doing the processing, so if you get the hang of the syntax it uses, you can do whatever. Getting all the linguistic knowledge to figure out what the rules should be is a different problem entirely. English is notoriously difficult in that aspect, see E.G. the o in woman vs. the o in women, the th in three VS. the th in though VS. the th in lighthouse. I haven’t used RH Voice extensively enough to know how much of that work is already done and how many elusive exceptions remain.