Skip to main content


So this looks like a high quality, fast, natural, and open source TTS system in Python. A key candidate for an #NVDA#addon. Unfortunately, I find #nvdasr addon development super confusing. Is there a good template to start from or something? github.com/thewh1teagle/kokoro-onnx

reshared this

in reply to Samuel Proulx

Here's a much longer example of the quality of speech Kokoro TTS generates. I really do think it might be a decent #NVDA addon. The weird pauses are because I'm just giving it a big long string, rather than chunking it like I should. It generates this in real time on CPU, and faster on GPU. The code to generate it is as follows:
import soundfile as sf
from kokoro_onnx import Kokoro
from onnxruntime import InferenceSession

session = InferenceSession("kokoro-v0_19.onnx", providers=["ROCMExecutionProvider", "CPUExecutionProvider"])
kokoro = Kokoro.from_session(session, "voices.json")
samples, sample_rate = kokoro.create(
"He wasn't sleeping very well, and he knew the people around him noticed, but he didn't know what to do about it. He had quietly gone to Madame Pomfrey, who had regretfully told him that Dreamless Sleep was highly addicting and that while she could give him the occasional dose, it would have to be spread out enough to prevent it from becoming addicting – meaning he could only take it one night out of every two weeks or so. It was one night more of productive sleep than he'd be getting otherwise, so he still did it, but it didn't help the larger issue. He wasn't under the effects of any nightmare-inducing Curses, potions, or other magical ailments, so there was nothing for Madame Pomfrey to do. The nightmares were coming from his own mind, and she was not a Mind-Healer. She'd offered to try and connect Harry with one, but when Harry discovered that it involved having someone else quite literally entering his mind with magic and helping him sort out things like trauma he couldn't. If Harry couldn't even tell Hermione the extent of what he'd suffered at the Dursley's, he wasn't about to let a stranger into his mind to see it. Let alone the 'adventures' of his Hogwarts years. So the nightmares persisted, and with the poor quality of sleep serving as the first domino, everything else slowly began to fall. His grades weren't slipping yet, but he was struggling with the study schedule Hermione had set out for them and doing his homework took more effort, more energy that he didn't have.", voice="af_sarah", speed=1.0, lang="en-us"
)
sf.write("audio.wav", samples, sample_rate)
print("Created audio.wav")

#nvda
in reply to Samuel Proulx

@FreakyFwoof Yeah, that sounds amazing. I would love to read stuff with that synthesiser.
in reply to Andre Louis

ha. I know very little about how we could get it compiled right in the add-on. (I know there was a discussion of this earlier so if that build process for onnxruntime into the add-on succeeded, would love some basic copy then.) For anyone wanting to try, I think looking at something like the Brailab driver (which is super minimal and in the end all you're really going to use are the getters and setters for the synth driver, the way you do speech is obviously not at all like Brailab), and then crafting in to open the stream might work. But between the latest family emergency, work at Spotify with the new year / new projects, I'm afraid I'll be swamped for awhile to give it that truly comparitive look. I'd also love to see a test run at how quick it can synthesize speech on slower CPUs especially when that speech is interrupted mid-utterance - how does it handle stopping a stream and loading a new one, is there lots of latancy? A simple py test that just throws lots of speech chunks like that, stops, starts, would give us an idea maybe to then know if it's worth turning into a driver just yet.
This entry was edited (1 week ago)
in reply to Tamas G

Sorry to hear about family emergencies, never nice to deal with. I hope things can be sorted out for the better.

Re slow CPU though, that's where I come in. I am right now even, using an Intel Core I5-3570K from 2012. It runs every synth very well, apart from Piper which it struggles with due to the neural aspect of it. If my machine can run... Whatever you guys end up coming up with (hopefully) then anything else should be a breeze.

in reply to Andre Louis

I have an even slower one. Yay for countries in the middle of... Well somewhere, and computers from 2009 haha if something can even run on that, I'd be surprised. How's that for a slow processor? It's pretty ancient. The synth sounds nice, yeah, don't like how it reads hashtag, but I guess that's me. There's also something about question marks it clearly missed, but I think it needs to be fed a bigger chunk of text to see if it'll sound better. Otherwise, for the quality, Bleh, either my ears, or something, do not consider it a great quality in the sound terms, but for a TTS, I guess it's good. says the person who daily drives a TTS that came out in 2001. LOL.
in reply to Andre Louis

A synth that does English people no good. Haha. And I have a dell from 2009, it has still a 32Bit windows 10 version, so it tells you something. :D
in reply to Winter blue tardis🇧🇬🇭🇺

I also cannot tell you the full specs. Computer not here, sadly. It has a removable battery though, that gave up a long time ago, then I fell down some stairs while carrying set computer, and the pixels in the screen went poof, and no screen.
in reply to Peter Vágner

@pvagner@mush42 I'm not sure. I do kind of worry about a tts developed by and for blind people and if it can be kept up to date and maintained.
in reply to Samuel Proulx

@Samuel Proulx I understand @Musharraf has made verry significant progress for example as compared to piper TTS. To me it looks it's much lighter for both training and using trained model even enhancing audio quality and elligibility in the process. This is just my guess but with such an achievement it's fine not to limit it to blind audience exclusivelly. This is how I am seeing #optispeech. However I haven't played with kokoro TTS thus I have asked how much do you like it for example while comparing to something else, perhaps piper TTS if you do know that one.
in reply to Peter Vágner

@pvagner@mush42 I like kokoro much better than piper. It sounds more natural with fewer artifacts.