Skip to main content


Exciting news on open-source neural voices!
Our first experiment is complete with fantastic results! Check out the audio sample attached to this post.
For this month, @pneumasolutions provided GPU resources for training. I really appreciate their contribution.
This is just the beginning. To keep training going, I'm still accepting donations. Any amount helps.
I'm happy to receive your donations via PayPal:
paypal.me/geotts
Please mention mush42/tts in the notes.
#SpeechSynthesis #AI #ML

Matt Campbell reshared this.

in reply to Musharraf

QWow that sounds way better. I hope to see this working with NVDA in the future.
in reply to Musharraf

does this use piper, or something else? it sounds a bit flatter than other piper voices
in reply to the esoteric programmer

@esoteric_programmer
It uses OptiSpeech, developed by me based on recent advances in neural TTS technology.
Piper is based on Vits, which dates back to 2021.
github.com/mush42/optispeech/
in reply to Musharraf

is that capable of working in low resource environments? would the resulting thing be able to generate samples at a speed good enough for most screenreader use? or, is this not ment for that use case?
in reply to the esoteric programmer

@esoteric_programmer
The model is designed from the ground up to be used with a screen reader running on the CPU.
It takes a lot of experimentation to strike the right balance between model efficiency and output quality. But I'm getting there!
in reply to Musharraf

wow it sounds so much better than vits! It sounds like it's not even based on hifigan stuff
in reply to spacedoggy

@spacepup
It is not based on Vits.
While the underlying repo supports multiple model architectures, this particular run is based on ConvNeXt-TTS architecture:
ieeexplore.ieee.org/document/1…
in reply to Musharraf

may i have a link to your optispeech hfc female checkpoint? I'd like to test this, i am really intreagued
in reply to spacedoggy

@spacepup
Glad to hear this!
For now, I'm not making checkpoints available due to them being unstable.
While initial results are promising, I discovered an issue with FFT parameters. Fixed it, and currently waiting for the server to become available to fine-tune with corrected data.
Upon getting consistent output quality, I'll publish an online demo, and make the pretrained checkpoints available via HuggingFace.