Zach Bennoui

9 months ago

Zach Bennoui
9 months ago

So I've seen this talked about here a bit, but I wanted to give more context on Kokoro TTS. This model was open sourced back on December 25, and was trained almost entirely on synthetic data taken from Eleven Labs and Open AI. Legality aside, the quality speaks for itself. This is an 82 million parameter model, which is very small by today's standards, but that means it's incredibly fast even on CPU.

The main dev responsible for training seems to know much more than the average open source enthusiast about how to make high-quality TTS, and I think the results speak for themselves. The model is under very active development and still quite young, more data is currently being collected, and a new version will be trained and released likely in the coming months. Their Discord is quite active, and I'm over there as well if you'd like to join. I think this has the potential to be a great option for blind screen reader users, who may not be able to afford something like Vocalizer on Windows, but we're not quite there just yet in terms of performance.

Here is a demo of one of the voices reading about Android.

Link to model card on Huggingface: huggingface.co/hexgrad/Kokoro-…
Link to Discord: discord.gg/QuGxSWBfQy

hexgrad/Kokoro-82M · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

^{huggingface.co}

Peter Vágner likes this.

reshared this

in reply to Zach Bennoui

Tamas G

in reply to Zach Bennoui 9 months ago

that voice somehow oddly reminds me of Vocalizer Zoe. Hopefully some day we'll be able to fine-tune our own model from theirs with our voice training data, will probably jump more onboard with it once we get a notebook / more tools and instructions on doing so. But I'm glad to know others are making them aware of this potential as it would be a shame to let an engine like that go to waste.

in reply to Tamas G

Zach Bennoui

in reply to Tamas G 9 months ago

@Tamasg This was already brought up and the model is just a fine tune of Style TTS2, which can already be trained. There is no benefit to additional fine-tuning on this model as it was almost exclusively trained on synthetic data, so they recommend that if you would like to make your own voices, you should fine-tune the base STTS2 instead.

@Tamas G

in reply to Zach Bennoui

Tamas G

in reply to Zach Bennoui 9 months ago

ooh, interesting, Style TTS I haven't heard of. Looks like they do have a way to train through notebooks (github.com/yl4579/StyleTTS2/di…) - but that means it might be easier to get an NVDA driver for that specifically rather than this model subset? not sure.

StyleTTS2 Training from Scratch Notebooks · yl4579 StyleTTS2 · Discussion #144

I'm currently learning how to train a custom StyleTTS2 model from scratch. I'm very new to this and thanks to this amazing project and its community, I've already gained a considerable amount of kn...

^GitHub

in reply to Tamas G

Luis Carlos

in reply to Tamas G 9 months ago

@Tamasg Oh that one! More to the Sonata framework, 5 in total

@Tamas G

in reply to Tamas G

Zach Bennoui

in reply to Tamas G 9 months ago

@Tamasg Yeah. STTS came out last year, and at the time it was competitive with the more recent transformer based commercial models, such as 11 labs. I'm not sure how it's held up or how development has been proceeding, I have not tried fine-tuning that model yet myself as it looked quite hard initially.

@Tamas G

in reply to Zach Bennoui

Zach Bennoui

in reply to Zach Bennoui 9 months ago

@Tamasg The main advantage with something like style TTS is that you can give it data from say an audiobook, and the model will have a much easier time really understanding the intonation patterns and expression in the speech as opposed to a more simple architecture, such as what's used by Piper.

@Tamas G

in reply to Zach Bennoui

Tamas G

in reply to Zach Bennoui 9 months ago

yeah, I did partially wonder that - one limitation we have with screen reader voices is not just that it's a lot of small speech chunks, but we can't feed it text ahead to help the model sound more natural when expressing it, which could still cause a disjointing in speech patterns if it's not able to change punctuation as dynamically as in a full text passage.

in reply to Tamas G

Zach Bennoui

in reply to Tamas G 9 months ago

@Tamasg that's a great point and something I haven't really considered. I wonder if that's part of the reason for the hesitation on the part of a lot of screen reader companies to switch over to neural options, especially the more recent transformer based approaches that really require proper sentence or paragraph level context in order to sound natural.

@Tamas G

in reply to Zach Bennoui

Tamas G

in reply to Zach Bennoui 9 months ago

would be interesting if you tried to do a sample inference of chunks that way for a simple dialog flow - but you would need to manually split as though it's indexing by the screen reader. For NVDA I've noticed this may be 2-4 chunks per flow: Dialog name and dialog role (1), body text (2), Focused item (3), additional state to that first-focused item such as collapsed (4). So you'd need to inference a passage with smaller splits like that to truly know, then merge all wavs into one single file xD

This entry was edited (9 months ago)

in reply to Tamas G

Peter Vágner

in reply to Tamas G 9 months ago from RaccoonForFriendica

@Tamas G @Zach Bennoui My use case is training from scratch as I'd like to try creating a better sounding slovak voice.
Do I understand correctly that kokoro is adaptation of style2 specifically for english?

@Tamas G @Zach Bennoui

in reply to Peter Vágner

Winter blue tardis

in reply to Peter Vágner 9 months ago

I would love to know how to train one of these and make a cool-sounding Bulgarian voice. IF it's really lighter than what I assume we have to our disposal as far as voices go, the heavy stuff. Because even my AMD processors don't take kindly to however our neural engine crap new voices are made. Now that'll be something to experiment with. :D

in reply to Winter blue tardis

Peter Vágner

in reply to Winter blue tardis 9 months ago

@Winter blue tardis🇧🇬🇭🇺 I hope you don't mind me being so curious. I am learning slowly and working on these things even slower but I have already helped to train slovak piper voice. The result was not that awfull hover it was not great either. So for a few months I am now working on improving espeak phonemization rules for slovak language trying to make sure all the language specific features are respected as much as possible even before training. I know at least piper and optispeech are using espeak for phonemization under the hood. These days another piper training of slovak voice with these improvements is running over here and it turns out it's sounding much better in terms of pronounciation. However only time will tell us if people will like it though. All this work is based off of previous work we were doing with friends while making slovak voices for RHVoice TTS. So we have text prompts and high quality voice recordings for slovak.

And now those questions:
Apart of the robotic sounding voice, How do you like bulgarian espeak pronounciation?
Is it similar in complexity to russian? I can't speak nor understand russian, however I do know russian espeak rules include a huge list of exceptions and russian speaking people don't still like it that much.
Do you have high quality bulgarian recordings of single speaker or do you know of a public data set that may include such a recordings?

@Winter blue tardis

in reply to Peter Vágner

Winter blue tardis

in reply to Peter Vágner 9 months ago

Hmm, other than the Espeak R as pronounciation being weird and throaty instead of the rolling r that we all are familiar with, I haven't had other issues.

in reply to Winter blue tardis

Peter Vágner

in reply to Winter blue tardis 9 months ago

@Winter blue tardis🇧🇬🇭🇺 I find it much more work to record a high quality sound recordings of your voice. For slovak my friends managed to record more than 3000 sentences or short phrases. Nabucasa have published some of the audio data they were using to train their piper voices and I think they are at least 1000 recordings of single speaker per language dataset. If you are considering to do this work, perhaps try to research how to train on a different language / single speaker audio data set so you will get familiar with the process and then you can evaluate if you can do it on your own or with help of your friends and similar.
Perhaps I am missremembering but I guess you might also be able to understand hungarian and there are some recordings for hungarian you might be able to experiment with if you are not going to learn with english data.

@Winter blue tardis

in reply to Peter Vágner

Winter blue tardis

in reply to Peter Vágner 9 months ago

Oh, sure sounds like a lot of work. Surely, if I had no job, I could do it, seems time-consuming. And I understand basic phrases, I am learning, but slowly. I took Spanish as a second language in university, and have not graduated do to difficulties with the teachers and a lot of fighting for accessibility. It sucks. And also the fact Spanish isn't coming to my intuition like, say English, is also not nice. But I can speak both English and Bulgarian quite fluently.

in reply to Winter blue tardis

Peter Vágner

in reply to Winter blue tardis 9 months ago

@Winter blue tardis🇧🇬🇭🇺 My first attempt at verry small TTS related contribution was almost 15 years ago. At that time Jonathan Duddington, the original eSpeak author created a slovak voice for eSpeak as a reply to my request. At that time I have tried to contributed some dictionary rules so slovak voice for eSpeak started to be usefull. That made me happy and for years I was just using eSpeak on all my devices. Then sometime at the end of 2022 I have found out group of talented people have created great polish voices for the RHVoice TTS. I was wondering how talented polish guys were able to accomplish that and I have started looking around, asked a lot of friends and I was still unable to get something usefull done. Sometime in early 2023 I got in contact with @Zvonimir Stanecic who started working on czech voice for RHVoice. Under his stewardship I was able to learn how to train RHVoice and how to write language rules for RHVoice. My friend ondro shared our vision of better sounding slovak voice and thanks to his talent as a sound engineer as well as a radio speaker here in slovakia he managed to prepare really very high quality audio recordings we are now using in all these experiments. It took us some three or four weeks of very intensive daily work during april and may of 2023 to prepare a first beta version of the slovak voice. When it became evident we haven't wasted our times and our prototype voice started sounding usefull, more close friends have joined us by providing their feetback. Sometime in july 2023 again with @Zvonimir Stanecic leading the team we were able to release first slovak male voice for #RHVoice. In 2024 two more friends have joined the team, managed to record enough high quality audio recordings to train a female voice. So now we do have two slovak voices for RHVoice. During 2024 except of training the female voice we were slowly being improving graphemes to phonemes translation improving the slovak language support for RHVoice. At that time we have realized there are other even better sounding TTS engines out there and wanted to use our recordings for training those. And now the circle is closing we have found out we need to improve eSpeak phonemizer with the experience we have earned during improving RHVoice slovak language support. It took us some 6 months of interupted work to port new RHVoice specific language support back to eSpeak. We even haven't managed to made these available to public yet. We are now experimenting training piper and trying to improve RHVoice even further.
So while I can't really understand and speak other languages, I might be able to try answering some of your questions either alone or in cooperation with other guys helping within this team and if you are passionate enough I think it's very likelly you will be able to achieve great result.
Again let me repeat @Zvonimir Stanecic is the number one language expert leading teams working on czech, hungarian, croatian, serbian, slovak and other languages support for RHVoice. I think @Cleverson you might find this story of mine and my friends inspiring too.

#RHVoice @Cleverson @Winter blue tardis @Zvonimir Stanecic

in reply to Peter Vágner

Cleverson

in reply to Peter Vágner 9 months ago

@pvagner @tardis @asael thanks for this testimonial. I actually prefer RHVoice's pronounciation in Portuguese over ESpeak. If it is OK, I believe I can make some RHVoice recordings yes.

@Peter Vágner @Winter blue tardis @Zvonimir Stanecic

in reply to Cleverson

Peter Vágner

in reply to Cleverson 9 months ago

@Cleverson Hopefully I am not that rude by asking this. Feel free to ignore if you think it's not appropriate. Since you were involved with brazilian portuguese voice, do you by any chance have access to the prompts and recordings that were used to train brazilian portuguese voice for RHVoice? If you would be allowed to reuse these you might be able to use that data to train other engines such as piper.

@Cleverson

in reply to Peter Vágner

Cleverson

in reply to Peter Vágner 9 months ago

@pvagner Yes, I have access to them all, but you are right that I would need to ask the person to grant me authorization to use them. Since I'm not sure I'll get it, I'm wondering whether Eloquence in Portuguese would be OK.

@Peter Vágner

in reply to Cleverson

Peter Vágner

in reply to Cleverson 9 months ago

@Cleverson Well I see recording eloquence or any other voice that is still being offered commercially as a shady practice. Still if you would like to do it for learning the process of training or simply for fun, no one can stop you from doing that. I assumed we are looking to train as natural sounding voice as possible.

@Cleverson

in reply to Peter Vágner

Cleverson

in reply to Peter Vágner 9 months ago

@pvagner OK, that's because ESpeak sounds too artificial in Portuguese, though there are people who like it, due to it being inteligible at fast speeds. In any case, I'll try contacting the person who recorded the speeches for RHVoice.

@Peter Vágner

in reply to Cleverson

Peter Vágner

in reply to Cleverson 9 months ago from RaccoonForFriendica

@Cleverson I am not sure I have explained this thing very well so I'll try again in a different words.
Each engine I've worked with so far does at least two stages with the text it's asked to speak.
First it transforms all the written letters into its internal representation of individual sounds aka phonemes.
within this part none of the audio data is involved at all and it does not matter if we do have formant synthesis similar to eSpeak, HTK based synthesis similar to RHVoice it's just we have kind of dissassembled the text phrases into sounds and Written that as a code.
The engine then uses this data for producing speech according to the trained model.
So while I'm saying we need espeak while training piper or optispeech, I mean we will be using its phonemizer regardless of the audio data we will use for training.
Real linguists are able to apply knowledge they have acquired from phonology and morphology of the language. It's predicate or at least widelly known so eventhough we are not linguists but are motivated enough we can gradually improve this part and continue tweaking the phonemizer until we like the pronounciation.
So the engine wo'wounwon't be learning this part while training.
Programming the actual TTS signal processing is much more involved task, I think we can't do it on our own and we defer to the language model. It trains it-self to inherit characteristics like sounding, intonation, inflection and loads of the other properties from the audio recordings we will be using for training our chosen engine.

@Cleverson

in reply to Peter Vágner

Cleverson

in reply to Peter Vágner 9 months ago

@pvagner OK, thanks much. I actually like to stody basics of linguistics. If ESpeak's tone won't interfere in the final voice quality, then most of the problem is probably solved. I believe I still have all the written sentences used for RHVoice. Is it OK to take those sentences and record then using ESpeak?

@Peter Vágner

in reply to Cleverson

Peter Vágner

in reply to Cleverson 9 months ago from RaccoonForFriendica

@clv0@cwb.socialeSpeak is being used internally. We need to have ttext prompts and matching audio recordings of the voice which characteristics We wish to hear in the final result. I have mentioned your RHVoice contribution because I assumed it might be easiest for you to get or ask for those. If you are thinking of other high quality recordings they will be fine too I assume.

in reply to Peter Vágner

Tamas G

in reply to Peter Vágner 9 months ago

@pvagner @clv0 interesting that you helped the Slovak voice like that :) great dedication. Around 2009 or so we got Hungarian voice in ESpeak after Jonathan used my recordings to construct the phoneme data, but I must say, Hungarian RH Voice does have the better accent, although eSpeak still is not unpleasant like Vocalizer can be for longer passages so still a win. Since some of the RH Voices are taken from Piper and other open-source data voice quality can be a bit inconsistent but phoneme data is a bit better.

@Peter Vágner @Cleverson

in reply to Cleverson

Zvonimir Stanecic

in reply to Cleverson 9 months ago

@clv0 @pvagner @tardis and for the small note. I am also still learning and i will have a knowledge to pass very soon, but i will need to systematize it. It is also regarding our RHVoice. I am learning the labelling.xml thing. This low level things which flags the properties of the trained voice, not just the foma code. I am happy that i can worki with other people when it comes to maintaining and developing new languages.

@Peter Vágner @Cleverson @Winter blue tardis

in reply to Peter Vágner

Winter blue tardis

in reply to Peter Vágner 9 months ago

I could find recordings, I would not record off of movies, thing is, people have dialects, so the best you could do with Bulgarian is read texts yourself and record them. Otherwise it becomes real weird. If I only knew how the neural engines were trained, I would provide you the recordings, sadly, they don't give us such info.

in reply to Winter blue tardis

Zvonimir Stanecic

in reply to Winter blue tardis 9 months ago

@tardis @pvagner @Tamasg well, to train bulgarian, we will need accentuation dictionary. We don't have fixed stress here... към съжаление.

@Peter Vágner @Tamas G @Winter blue tardis

in reply to Tamas G

Luis Carlos

in reply to Tamas G 9 months ago

@Tamasg Once Sonata has those TTS models, we have in total 4 neural TTS synths in the framework: Those being Piper, OptiSpeech, Melo TTS and that new Kokoro tts system. That could be more than a good option for local AI driven neural TTS. As I said, I hope @mush42 will add this one, a longside MeloTTS, as other neural TTS providers

@Musharraf @Tamas G

in reply to Zach Bennoui

Mikołaj Hołysz

in reply to Zach Bennoui 9 months ago

This confirms my hypothesis that the primary reason open source neural TTS is so bad is lack of good datasets.

For some reason, there are many companies willing to open-source their LLMs, even though they're trained on books3 and other content scraped from the internet, but that isn't happening for tts.

in reply to Mikołaj Hołysz

Zach Bennoui

in reply to Mikołaj Hołysz 9 months ago

@miki Yes I've had the exact same thought. Using synthetic data for TTS is a relatively new practice, but it seems like it's been working out well for Kokoro so far. If the big companies refuse to make data available then we have to take matters into our own hands, otherwise we will forever be behind the big players in this space

@Mikołaj Hołysz

in reply to Zach Bennoui

Cleverson

in reply to Zach Bennoui 9 months ago

@pvagner Any chance of a Brazilian Portuguese voice? If so, please call me to contribute. I've contributed to Brazilian Portuguese ESpeak and RHVoice, and would be glad to help with another one.

@Peter Vágner

in reply to Cleverson

Peter Vágner

in reply to Cleverson 9 months ago

@Cleverson If you like brazilian portuguese espeak pronounciation and can either record or otherwise source good brazilian portuguese text prompts and corresponding audio recordings I can try to help you doing the same thing for brazilian portuguese I am doing for slovak and that is training either piper or optispeech at the moment, perhaps other engines in the future.
I am training on my laptop although it takes much more time than doing it on a high performance GPU better suited for that task. Other people including @Zach Bennoui and @Tamas G are training in the cloud as described here: github.com/ZachB100/Piper-Trai…

@Tamas G @Cleverson @Zach Bennoui

Zach Bennoui likes this.

Kaveinthran reshared this.

in reply to Peter Vágner

Zach Bennoui

in reply to Peter Vágner 9 months ago

@pvagner @Tamasg @clv0 Thanks for mentioning my training guide here, it's a little bit out of date, but I'm more than willing to help with any questions you guys may have. I'm very passionate about this stuff and have been heavily invested in open source TTS over the past few years. Unlike you, I have very little experience with some of the older engines such as RH, but would love to eventually learn enough to train a better quality US English voice for them that's a bit more expressive than what they currently offer.

@Peter Vágner @Tamas G @Cleverson

⇧

Zach Bennoui 9 months ago • •

Zach Bennoui
9 months ago