So I've seen this talked about here a bit, but I wanted to give more context on Kokoro TTS. This model was open sourced back on December 25, and was trained almost entirely on synthetic data taken from Eleven Labs and Open AI. Legality aside, the quality speaks for itself. This is an 82 million parameter model, which is very small by today's standards, but that means it's incredibly fast even on CPU.
The main dev responsible for training seems to know much more than the average open source enthusiast about how to make high-quality TTS, and I think the results speak for themselves. The model is under very active development and still quite young, more data is currently being collected, and a new version will be trained and released likely in the coming months. Their Discord is quite active, and I'm over there as well if you'd like to join. I think this has the potential to be a great option for blind screen reader users, who may not be able to afford something like Vocalizer on Windows, but we're not quite there just yet in terms of performance.
Here is a demo of one of the voices reading about Android.
Link to model card on Huggingface: huggingface.co/hexgrad/Kokoro-…
Link to Discord: discord.gg/QuGxSWBfQy
hexgrad/Kokoro-82M · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.huggingface.co
Peter Vágner likes this.
reshared this
Tamas G
in reply to Zach Bennoui • • •Zach Bennoui
in reply to Tamas G • • •Tamas G
in reply to Zach Bennoui • • •StyleTTS2 Training from Scratch Notebooks · yl4579 StyleTTS2 · Discussion #144
GitHubLuis Carlos
in reply to Tamas G • • •Zach Bennoui
in reply to Tamas G • • •Zach Bennoui
in reply to Zach Bennoui • • •Tamas G
in reply to Zach Bennoui • • •Zach Bennoui
in reply to Tamas G • • •Tamas G
in reply to Zach Bennoui • • •Peter Vágner
in reply to Tamas G • •Do I understand correctly that kokoro is adaptation of style2 specifically for english?
Winter blue tardis🇧🇬🇭🇺
in reply to Peter Vágner • • •Peter Vágner
in reply to Winter blue tardis🇧🇬🇭🇺 • •@Winter blue tardis🇧🇬🇭🇺 I hope you don't mind me being so curious. I am learning slowly and working on these things even slower but I have already helped to train slovak piper voice. The result was not that awfull hover it was not great either. So for a few months I am now working on improving espeak phonemization rules for slovak language trying to make sure all the language specific features are respected as much as possible even before training. I know at least piper and optispeech are using espeak for phonemization under the hood. These days another piper training of slovak voice with these improvements is running over here and it turns out it's sounding much better in terms of pronounciation. However only time will tell us if people will like it though. All this work is based off of previous work we were doing with friends while making slovak voices for RHVoice TTS. So we have text prompts and high quality voice recordings for slovak.
And now those questions:
Apart of the robotic sounding voice, How do you like bulgarian espeak pronounciation?
Is it similar in complexity to russian? I can't speak nor understand russian, however I do know russian espeak rules include a huge list of exceptions and russian speaking people don't still like it that much.
Do you have high quality bulgarian recordings of single speaker or do you know of a public data set that may include such a recordings?
Winter blue tardis🇧🇬🇭🇺
in reply to Peter Vágner • • •Peter Vágner
in reply to Winter blue tardis🇧🇬🇭🇺 • •Perhaps I am missremembering but I guess you might also be able to understand hungarian and there are some recordings for hungarian you might be able to experiment with if you are not going to learn with english data.
Winter blue tardis🇧🇬🇭🇺
in reply to Peter Vágner • • •Peter Vágner
in reply to Winter blue tardis🇧🇬🇭🇺 • •So while I can't really understand and speak other languages, I might be able to try answering some of your questions either alone or in cooperation with other guys helping within this team and if you are passionate enough I think it's very likelly you will be able to achieve great result.
Again let me repeat @Zvonimir Stanecic is the number one language expert leading teams working on czech, hungarian, croatian, serbian, slovak and other languages support for RHVoice. I think @Cleverson you might find this story of mine and my friends inspiring too.
Cleverson
in reply to Peter Vágner • • •Peter Vágner
in reply to Cleverson • •Cleverson
in reply to Peter Vágner • • •Peter Vágner
in reply to Cleverson • •Cleverson
in reply to Peter Vágner • • •Peter Vágner
in reply to Cleverson • •Each engine I've worked with so far does at least two stages with the text it's asked to speak.
First it transforms all the written letters into its internal representation of individual sounds aka phonemes.
within this part none of the audio data is involved at all and it does not matter if we do have formant synthesis similar to eSpeak, HTK based synthesis similar to RHVoice it's just we have kind of dissassembled the text phrases into sounds and Written that as a code.
The engine then uses this data for producing speech according to the trained model.
So while I'm saying we need espeak while training piper or optispeech, I mean we will be using its phonemizer regardless of the audio data we will use for training.
Real linguists are able to apply knowledge they have acquired from phonology and morphology of the language. It's predicate or at least widelly known so eventhough we are not linguists but are motivated enough we can gradually improve this part and continue tweaking the phonemizer until we like the pronounciation.
So the engine wo'wounwon't be learning this part while training.
Programming the actual TTS signal processing is much more involved task, I think we can't do it on our own and we defer to the language model. It trains it-self to inherit characteristics like sounding, intonation, inflection and loads of the other properties from the audio recordings we will be using for training our chosen engine.
Cleverson
in reply to Peter Vágner • • •Peter Vágner
in reply to Cleverson • •Tamas G
in reply to Peter Vágner • • •Zvonimir Stanecic
in reply to Cleverson • • •Winter blue tardis🇧🇬🇭🇺
in reply to Peter Vágner • • •Zvonimir Stanecic
in reply to Winter blue tardis🇧🇬🇭🇺 • • •Luis Carlos
in reply to Tamas G • • •Mikołaj Hołysz
in reply to Zach Bennoui • • •This confirms my hypothesis that the primary reason open source neural TTS is so bad is lack of good datasets.
For some reason, there are many companies willing to open-source their LLMs, even though they're trained on books3 and other content scraped from the internet, but that isn't happening for tts.
Zach Bennoui
in reply to Mikołaj Hołysz • • •Cleverson
in reply to Zach Bennoui • • •Peter Vágner
in reply to Cleverson • •I am training on my laptop although it takes much more time than doing it on a high performance GPU better suited for that task. Other people including @Zach Bennoui and @Tamas G are training in the cloud as described here: github.com/ZachB100/Piper-Trai…
Zach Bennoui likes this.
Zach Bennoui
in reply to Peter Vágner • • •