Alex Hall

2 years ago

Alex Hall
2 years ago

With all the neat AI stuff these days, is there a service that takes a person's voice and turns it into a SAPI5 synthesizer? I know companies like LyreBird are working on the voice part, but they never let you use the generated voice outside of their website.

in reply to Alex Hall

Paweł Masarczyk

in reply to Alex Hall 2 years ago

@queenslight RH Voice is not AI-based but if you can set up it right, it is possible to train a model on a bunch of recordings associated with sentences of prepared script and generate anything RH Voice works on: SAPI 5, NVDA Addons, Android APK's, Linux packages etc. I'm sort of surprised this hasn't caught on more in the community. I'll try to dig up a tutorial someone in Poland wrote on how to set up things properly for it to work.

@Trenton Matthews

in reply to Paweł Masarczyk

Alex Hall

in reply to Paweł Masarczyk 2 years ago

@Piciok @queenslight Really? That's neat. I tried RHVoice, and it wasn't bad. I didn't know it could be trained.

@Paweł Masarczyk @Trenton Matthews

in reply to Alex Hall

Peter Vágner

in reply to Alex Hall 2 years ago

Here is the walk through on creating new voice for #RHVoice github.com/RHVoice/RHVoice/wik…
If you are adding new voice for existing language that might not be that difficult to do I guess.
However I my-self have difficulties configuring phoneme translation for slovak and I am just on the beginning with my ultimate goal on creating #slovak voice.

#RHVoice #slovak

in reply to Peter Vágner

Alex Hall

in reply to Peter Vágner 2 years ago

@pvagner @Piciok @queenslight Ah, thank you. This looks extremely involved. Wow.

@Peter Vágner @Paweł Masarczyk @Trenton Matthews

in reply to Alex Hall

miki

in reply to Alex Hall 2 years ago

@pvagner @Piciok @queenslight They’re working on a way to do this in Colab apparently, which will be a lot simpler.

@Peter Vágner @Paweł Masarczyk @Trenton Matthews

in reply to miki

Alex Hall

in reply to miki 2 years ago

@miki @pvagner @Piciok @queenslight I'm not sure what that is, but if it makes the process easier, I'll just wait for that.

@Peter Vágner @miki @Paweł Masarczyk @Trenton Matthews

in reply to Alex Hall

miki

in reply to Alex Hall 2 years ago

@pvagner @Piciok @queenslight It’s a Google run service for training AI models and playing with all kinds of machine learning algorithm, free as long as you don’t overuse it. It’s based on Jupyter, but as far as I’ve heard, the fork they’re running is more accessible than the open source version.

@Peter Vágner @Paweł Masarczyk @Trenton Matthews

in reply to miki

Trenton Matthews

in reply to miki 2 years ago

@miki @pvagner @Piciok

Correct. It is quite accessible to screen readers. Admittedly best with Chrome, though Firefox ain't too bad with it.

@Peter Vágner @miki @Paweł Masarczyk

in reply to Alex Hall

miki

in reply to Alex Hall 2 years ago

RH Voice can do it, but it’s a bitch and a half to train and English still needs some work on the phonetic transcription side. Works well enough for most Eastern-european languages though.

in reply to miki

Alex Hall

in reply to miki 2 years ago

@miki I didn't see anything about training RHVoice. I installed their English voices a while ago, and one is pretty good. If one can train it, how is it still bad at English?

@miki

in reply to Alex Hall

miki

in reply to Alex Hall 2 years ago

RH Voice isn’t an end-to-end model, like Tacotron / Flowtron / Wavenet / whatever Apple is using. It roughly consists of three parts, text-to-phoneme transcription, which is entirely rules based, audio parameter estimation, which takes the phonemes as input and outputs what the audio should “look like”, in terms of fundamental frequencies and other such things, and audio synthesis, which takes these audio parameters and turns them into raw PCM samples, the kind you store in a wav file or send to your sound card. The first and third phases are just code, with the statistical, trained model in the middle. The first part is language-specific, the second is voice-specific, and the third is basically the same for all voices, minus some EQ settings. Training consists of transcribing your text to phonemes, calculating the parameters that the model should output, basically reversing the synthesis process and going from wav to parameters, and then making your model predict the right parameters in the right situations.

in reply to miki

Alex Hall

in reply to miki 2 years ago

@miki That sounds both hard and confusing, especially for someone like me who lacks any real understanding of the details.

@miki

in reply to Alex Hall

miki

in reply to Alex Hall 2 years ago

I on’t really understand the math either. Most of it is handled automatically behind the scenes, the only thing you need to know is that text processing (whether plugin is pronounced like plug in or like ploo gin) is controlled programmatically, via language rules, and whether the voice sounds male, female, old, young, like you or like me is based on the training data.

in reply to miki

Alex Hall

in reply to miki 2 years ago

@miki Can anyone do this? It seems like it should be pretty language-agnostic, if someone can adjust things.

@miki

in reply to Alex Hall

miki

in reply to Alex Hall 2 years ago

Do what? Change the language rules? Yeah, they use something called Foma for doing the processing, so if you get the hang of the syntax it uses, you can do whatever. Getting all the linguistic knowledge to figure out what the rules should be is a different problem entirely. English is notoriously difficult in that aspect, see E.G. the o in woman vs. the o in women, the th in three VS. the th in though VS. the th in lighthouse. I haven’t used RH Voice extensively enough to know how much of that work is already done and how many elusive exceptions remain.

⇧

Alex Hall 2 years ago • •

Alex Hall
2 years ago