So I've seen this talked about here a bit, but I wanted to give more context on Kokoro TTS. This model was open sourced back on December 25, and was trained almost entirely on synthetic data taken from Eleven Labs and Open AI. Legality aside, the quality speaks for itself. This is an 82 million parameter model, which is very small by today's standards, but that means it's incredibly fast even on CPU.

The main dev responsible for training seems to know much more than the average open source enthusiast about how to make high-quality TTS, and I think the results speak for themselves. The model is under very active development and still quite young, more data is currently being collected, and a new version will be trained and released likely in the coming months. Their Discord is quite active, and I'm over there as well if you'd like to join. I think this has the potential to be a great option for blind screen reader users, who may not be able to afford something like Vocalizer on Windows, but we're not quite there just yet in terms of performance.

Here is a demo of one of the voices reading about Android.

Link to model card on Huggingface: huggingface.co/hexgrad/Kokoro-…
Link to Discord: discord.gg/QuGxSWBfQy

reshared this

in reply to Zach Bennoui

that voice somehow oddly reminds me of Vocalizer Zoe. Hopefully some day we'll be able to fine-tune our own model from theirs with our voice training data, will probably jump more onboard with it once we get a notebook / more tools and instructions on doing so. But I'm glad to know others are making them aware of this potential as it would be a shame to let an engine like that go to waste.
in reply to Zach Bennoui

ooh, interesting, Style TTS I haven't heard of. Looks like they do have a way to train through notebooks (github.com/yl4579/StyleTTS2/di…) - but that means it might be easier to get an NVDA driver for that specifically rather than this model subset? not sure.
in reply to Zach Bennoui

would be interesting if you tried to do a sample inference of chunks that way for a simple dialog flow - but you would need to manually split as though it's indexing by the screen reader. For NVDA I've noticed this may be 2-4 chunks per flow: Dialog name and dialog role (1), body text (2), Focused item (3), additional state to that first-focused item such as collapsed (4). So you'd need to inference a passage with smaller splits like that to truly know, then merge all wavs into one single file xD
This entry was edited (3 months ago)
in reply to Peter Vágner

I would love to know how to train one of these and make a cool-sounding Bulgarian voice. IF it's really lighter than what I assume we have to our disposal as far as voices go, the heavy stuff. Because even my AMD processors don't take kindly to however our neural engine crap new voices are made. Now that'll be something to experiment with. :D
in reply to Winter blue tardis🇧🇬🇭🇺

@Winter blue tardis🇧🇬🇭🇺 I hope you don't mind me being so curious. I am learning slowly and working on these things even slower but I have already helped to train slovak piper voice. The result was not that awfull hover it was not great either. So for a few months I am now working on improving espeak phonemization rules for slovak language trying to make sure all the language specific features are respected as much as possible even before training. I know at least piper and optispeech are using espeak for phonemization under the hood. These days another piper training of slovak voice with these improvements is running over here and it turns out it's sounding much better in terms of pronounciation. However only time will tell us if people will like it though. All this work is based off of previous work we were doing with friends while making slovak voices for RHVoice TTS. So we have text prompts and high quality voice recordings for slovak.

And now those questions:
Apart of the robotic sounding voice, How do you like bulgarian espeak pronounciation?
Is it similar in complexity to russian? I can't speak nor understand russian, however I do know russian espeak rules include a huge list of exceptions and russian speaking people don't still like it that much.
Do you have high quality bulgarian recordings of single speaker or do you know of a public data set that may include such a recordings?

in reply to Winter blue tardis🇧🇬🇭🇺

@Winter blue tardis🇧🇬🇭🇺 I find it much more work to record a high quality sound recordings of your voice. For slovak my friends managed to record more than 3000 sentences or short phrases. Nabucasa have published some of the audio data they were using to train their piper voices and I think they are at least 1000 recordings of single speaker per language dataset. If you are considering to do this work, perhaps try to research how to train on a different language / single speaker audio data set so you will get familiar with the process and then you can evaluate if you can do it on your own or with help of your friends and similar.
Perhaps I am missremembering but I guess you might also be able to understand hungarian and there are some recordings for hungarian you might be able to experiment with if you are not going to learn with english data.
in reply to Peter Vágner

Oh, sure sounds like a lot of work. Surely, if I had no job, I could do it, seems time-consuming. And I understand basic phrases, I am learning, but slowly. I took Spanish as a second language in university, and have not graduated do to difficulties with the teachers and a lot of fighting for accessibility. It sucks. And also the fact Spanish isn't coming to my intuition like, say English, is also not nice. But I can speak both English and Bulgarian quite fluently.
in reply to Winter blue tardis🇧🇬🇭🇺

@Winter blue tardis🇧🇬🇭🇺 My first attempt at verry small TTS related contribution was almost 15 years ago. At that time Jonathan Duddington, the original eSpeak author created a slovak voice for eSpeak as a reply to my request. At that time I have tried to contributed some dictionary rules so slovak voice for eSpeak started to be usefull. That made me happy and for years I was just using eSpeak on all my devices. Then sometime at the end of 2022 I have found out group of talented people have created great polish voices for the RHVoice TTS. I was wondering how talented polish guys were able to accomplish that and I have started looking around, asked a lot of friends and I was still unable to get something usefull done. Sometime in early 2023 I got in contact with @Zvonimir Stanecic who started working on czech voice for RHVoice. Under his stewardship I was able to learn how to train RHVoice and how to write language rules for RHVoice. My friend ondro shared our vision of better sounding slovak voice and thanks to his talent as a sound engineer as well as a radio speaker here in slovakia he managed to prepare really very high quality audio recordings we are now using in all these experiments. It took us some three or four weeks of very intensive daily work during april and may of 2023 to prepare a first beta version of the slovak voice. When it became evident we haven't wasted our times and our prototype voice started sounding usefull, more close friends have joined us by providing their feetback. Sometime in july 2023 again with @Zvonimir Stanecic leading the team we were able to release first slovak male voice for #RHVoice. In 2024 two more friends have joined the team, managed to record enough high quality audio recordings to train a female voice. So now we do have two slovak voices for RHVoice. During 2024 except of training the female voice we were slowly being improving graphemes to phonemes translation improving the slovak language support for RHVoice. At that time we have realized there are other even better sounding TTS engines out there and wanted to use our recordings for training those. And now the circle is closing we have found out we need to improve eSpeak phonemizer with the experience we have earned during improving RHVoice slovak language support. It took us some 6 months of interupted work to port new RHVoice specific language support back to eSpeak. We even haven't managed to made these available to public yet. We are now experimenting training piper and trying to improve RHVoice even further.
So while I can't really understand and speak other languages, I might be able to try answering some of your questions either alone or in cooperation with other guys helping within this team and if you are passionate enough I think it's very likelly you will be able to achieve great result.
Again let me repeat @Zvonimir Stanecic is the number one language expert leading teams working on czech, hungarian, croatian, serbian, slovak and other languages support for RHVoice. I think @Cleverson you might find this story of mine and my friends inspiring too.
in reply to Cleverson

@Cleverson Hopefully I am not that rude by asking this. Feel free to ignore if you think it's not appropriate. Since you were involved with brazilian portuguese voice, do you by any chance have access to the prompts and recordings that were used to train brazilian portuguese voice for RHVoice? If you would be allowed to reuse these you might be able to use that data to train other engines such as piper.
in reply to Cleverson

@Cleverson I am not sure I have explained this thing very well so I'll try again in a different words.
Each engine I've worked with so far does at least two stages with the text it's asked to speak.
First it transforms all the written letters into its internal representation of individual sounds aka phonemes.
within this part none of the audio data is involved at all and it does not matter if we do have formant synthesis similar to eSpeak, HTK based synthesis similar to RHVoice it's just we have kind of dissassembled the text phrases into sounds and Written that as a code.
The engine then uses this data for producing speech according to the trained model.
So while I'm saying we need espeak while training piper or optispeech, I mean we will be using its phonemizer regardless of the audio data we will use for training.
Real linguists are able to apply knowledge they have acquired from phonology and morphology of the language. It's predicate or at least widelly known so eventhough we are not linguists but are motivated enough we can gradually improve this part and continue tweaking the phonemizer until we like the pronounciation.
So the engine wo'wounwon't be learning this part while training.
Programming the actual TTS signal processing is much more involved task, I think we can't do it on our own and we defer to the language model. It trains it-self to inherit characteristics like sounding, intonation, inflection and loads of the other properties from the audio recordings we will be using for training our chosen engine.
in reply to Cleverson

@clv0@cwb.socialeSpeak is being used internally. We need to have ttext prompts and matching audio recordings of the voice which characteristics We wish to hear in the final result. I have mentioned your RHVoice contribution because I assumed it might be easiest for you to get or ask for those. If you are thinking of other high quality recordings they will be fine too I assume.
in reply to Peter Vágner

@pvagner @clv0 interesting that you helped the Slovak voice like that :) great dedication. Around 2009 or so we got Hungarian voice in ESpeak after Jonathan used my recordings to construct the phoneme data, but I must say, Hungarian RH Voice does have the better accent, although eSpeak still is not unpleasant like Vocalizer can be for longer passages so still a win. Since some of the RH Voices are taken from Piper and other open-source data voice quality can be a bit inconsistent but phoneme data is a bit better.
in reply to Cleverson

@clv0 @pvagner @tardis and for the small note. I am also still learning and i will have a knowledge to pass very soon, but i will need to systematize it. It is also regarding our RHVoice. I am learning the labelling.xml thing. This low level things which flags the properties of the trained voice, not just the foma code. I am happy that i can worki with other people when it comes to maintaining and developing new languages.
in reply to Cleverson

@Cleverson If you like brazilian portuguese espeak pronounciation and can either record or otherwise source good brazilian portuguese text prompts and corresponding audio recordings I can try to help you doing the same thing for brazilian portuguese I am doing for slovak and that is training either piper or optispeech at the moment, perhaps other engines in the future.
I am training on my laptop although it takes much more time than doing it on a high performance GPU better suited for that task. Other people including @Zach Bennoui and @Tamas G are training in the cloud as described here: github.com/ZachB100/Piper-Trai…

Kaveinthran reshared this.

in reply to Peter Vágner

@pvagner @Tamasg @clv0 Thanks for mentioning my training guide here, it's a little bit out of date, but I'm more than willing to help with any questions you guys may have. I'm very passionate about this stuff and have been heavily invested in open source TTS over the past few years. Unlike you, I have very little experience with some of the older engines such as RH, but would love to eventually learn enough to train a better quality US English voice for them that's a bit more expressive than what they currently offer.