Musharraf

1 year ago

Musharraf
1 year ago

Exciting news on open-source neural voices!
Our first experiment is complete with fantastic results! Check out the audio sample attached to this post.
For this month, @pneumasolutions provided GPU resources for training. I really appreciate their contribution.
This is just the beginning. To keep training going, I'm still accepting donations. Any amount helps.
I'm happy to receive your donations via PayPal:
paypal.me/geotts
Please mention mush42/tts in the notes.
#SpeechSynthesis #AI #ML

Zaplaťte uživateli Beka Gozalishvili pomocí služby PayPal.Me

Přejděte na adresu paypal.me/geotts a zadejte částku. Protože jde o PayPal, je to jednoduché a bezpečné. Nemáte účet PayPal? Nevadí.

^PayPal.Me

Jason J.G. White likes this.

Matt Campbell reshared this.

in reply to Musharraf

patricus

in reply to Musharraf 1 year ago

sounds like my teacher.

in reply to Musharraf

the esoteric programmer

in reply to Musharraf 1 year ago

does this use piper, or something else? it sounds a bit flatter than other piper voices

in reply to the esoteric programmer

Musharraf

in reply to the esoteric programmer 1 year ago

@esoteric_programmer
It uses OptiSpeech, developed by me based on recent advances in neural TTS technology.
Piper is based on Vits, which dates back to 2021.
github.com/mush42/optispeech/

GitHub - mush42/optispeech: A lightweight end-to-end text-to-speech model

A lightweight end-to-end text-to-speech model. Contribute to mush42/optispeech development by creating an account on GitHub.

^GitHub

@the esoteric programmer

in reply to Musharraf

the esoteric programmer

in reply to Musharraf 1 year ago

is that capable of working in low resource environments? would the resulting thing be able to generate samples at a speed good enough for most screenreader use? or, is this not ment for that use case?

in reply to the esoteric programmer

Musharraf

in reply to the esoteric programmer 1 year ago

@esoteric_programmer
The model is designed from the ground up to be used with a screen reader running on the CPU.
It takes a lot of experimentation to strike the right balance between model efficiency and output quality. But I'm getting there!

@the esoteric programmer

in reply to Musharraf

moved to spacepup@mastodon.sti

in reply to Musharraf 1 year ago

wow it sounds so much better than vits! It sounds like it's not even based on hifigan stuff

in reply to moved to spacepup@mastodon.sti

Musharraf

in reply to moved to spacepup@mastodon.sti 1 year ago

@spacepup
It is not based on Vits.
While the underlying repo supports multiple model architectures, this particular run is based on ConvNeXt-TTS architecture:
ieeexplore.ieee.org/document/1…

Convnext-TTS And Convnext-VC: Convnext-Based Fast End-To-End Sequence-To-Sequence Text-To-Speech And Voice Conversion

End-to-end (E2E) sequence-to-sequence (S2S) neural text-to-speech (TTS) models and E2E-S2S neural voice conversion (VC) models can achieve high-quality speech synthesis with a single neural network.

^{ieeexplore.ieee.org}

@moved to spacepup@mastodon.sti

in reply to Musharraf

moved to spacepup@mastodon.sti

in reply to Musharraf 1 year ago

may i have a link to your optispeech hfc female checkpoint? I'd like to test this, i am really intreagued

in reply to moved to spacepup@mastodon.sti

Musharraf

in reply to moved to spacepup@mastodon.sti 1 year ago

@spacepup
Glad to hear this!
For now, I'm not making checkpoints available due to them being unstable.
While initial results are promising, I discovered an issue with FFT parameters. Fixed it, and currently waiting for the server to become available to fine-tune with corrected data.
Upon getting consistent output quality, I'll publish an online demo, and make the pretrained checkpoints available via HuggingFace.

@moved to spacepup@mastodon.sti

⇧

Musharraf 1 year ago • •

Musharraf
1 year ago