Skip to main content


Hey Mastodon!
#helpwanted

I've been quietly working on a fast and lightweight neural Text-To-Speech (TTS) model for NVDA/SAPI.

The next step is training the model, and that requires some serious GPU power. Unfortunately, those resources are a bit out of my reach right now.

This is where I could really use your help, if you're interested!

Hubert Figuière reshared this.

in reply to Musharraf :verified:

This is where I could really use your help, if you're interested!
- Help with training costs: I've been fortunate to receive a grant from Google's TRC program, but there are some additional expenses. Any contribution would be incredibly helpful.
- Donating spare GPU power or Colab credit: Even a little bit would be a huge boost!
in reply to Musharraf :verified:

All of my work on neural TTS is completely free and open-source.
github.com/mush42/sonata-nvda
github.com/mush42/sonata

Together, we can make high quality TTS technology a reality for more people.

in reply to Musharraf :verified:

The model is targeted for on-device text-to-speech.
This means it is efficient and has low latency.
It is open-source:
github.com/mush42/optispeech
in reply to Nick Giannak III

@nick
This is more efficient and lightweight. Compared to Piper, this model is more responsive and requires less system resources.
Also this is a modern TTS implementation, I referred to papers published in 2023-2024.
in reply to Musharraf :verified:

GOtcha. I might throw you a few dollars to maek it go, especially if you can come up with training data that won't have the pronunciation problems that existed with PIper.
in reply to Nick Giannak III

@nick
Currently I'm exclusively working on the model architecture, which will resolve some of the issues.
I'll train on freely available, high quality datasets, but creating a new dataset from scratch is beyond my current resources.
I'll leave this for later, and I can help anyone who wants to take up this task.
in reply to Musharraf :verified:

Hey, the devil is in the details. If we need a new dataset, then we'll se how we can go about funding it.
in reply to Nick Giannak III

@nick
A good dataset will definitely help a lot. Not only me, but any future developer who will work in this field.
Also, a high quality dataset can establish a bridge between our community and academia, where major TTS breakthroughs happen. We give you our dataset to evaluate your models, and you allow us to use your great TTS model architectures.
in reply to Musharraf :verified:

@Musharraf :verified: Please, when preparing the dataset what's the difference between train and val? If I have single speaker recording what do I put in those folders?
in reply to Peter Vágner

@pvagner
In machine learning, A dataset is usually split into two splits.
The 'train' split is the larger, and is used as input for model training.
The 'val' split is relatively small, and it is used for evaluating how the model performs during training.
in reply to Peter Vágner

@pvagner
Given you have a list of wav files and the corresponding transcription, first you need to decide on the size of each split.
Depending on the size of your dataset, you can split it 95%-5% or 99%-1% for training and val respectively.
in reply to Peter Vágner

@pvagner
I can help you with preparing your dataset.
Please DM me with any questions you have.
in reply to Tomecki

@tomecki
Simply put:
Sonata is an inference engine that can theoretically drive any TTS model.
OptiSpeech is an actual model that generates speech, and the quality of the output depends on it.
in reply to Luis Carlos

@luiscarlosgonzalez @NVAccess

I' afraid I cannot. The whole point of this and Sonata is to create high-quality, but very efficient and lightweight neural TTS. Coqui-TTS models are neither efficient nor lightweight.

in reply to Luis Carlos

@luiscarlosgonzalez
Tortoise TTS is too heavy for a high-end server, let alone a standard computer or a mobile device.
This system is designed specifically for running on a standard CPU.
in reply to Musharraf :verified:

Hey, I'd like to contribute some financial support. Can you ballpark how much you'd need to spend to train a voice?
in reply to Scott

Money

Sensitive content

in reply to Musharraf :verified:

Money

Sensitive content

in reply to Musharraf :verified:

Money

Sensitive content

in reply to Roberto Perez

@rperez030
Thanks for your contribution. Really appreciate it!
For this month, Pneuma Solutions provided GPU resources for training. But I still accept donations to ensure we have enough resources for future training needs.
I'm happy to receive your donations via my colleague's paypal address:
paypal.me/geotts
or
Email:
info@geotts.ge
Please mention mush42/tts in the transaction note.
in reply to Musharraf :verified:

@alexhall I have a GPU that I would love to rent out for this project. How would I get started?