Skip to main content


F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching! The quality is pretty impressive for open source, and it even supports mps for Mac! I was able to get it going on my Mac with no problem. #TTS #ML #AI
github.com/SWivid/F5-TTS
@ZBennoui

reshared this

in reply to Luis Carlos

@luiscarlosgonzalez
Too heavy for local use. Unless you have a GPU with at least 8GB of VRAM. Or, you know, a Mac.
in reply to Musharraf

@mush42 I will have a computer with that GPU later, so my exptectation is almost high
in reply to Chi Kim

what's this, more precisely? is this some kind of tts? how does it sound like? I tryed to understand something from the github, but it's kinda flying above my head
in reply to the esoteric programmer

@esoteric_programmer
It converts the following text:
Every time I see someone light up, um, because of something I’ve made, it’s like, wow, a little piece of my inner child gets healed, you know? And, um, when...snip

To the attached speech.

Tamas G reshared this.

in reply to Musharraf

@esoteric_programmer
You can easily plug this into an open-source LLM and get something akin to NotebookLM.
Totally free and open-source, with very high quality.
in reply to Musharraf

so, this does text completion and then generates speech using something like tts? is that correct so far? or do you attach audio of something, the model transcribes it and gets its meaning in whatever way that's considered meaning anyway, then concatenates your prompt text to that? that could create so, so many deepfakes, it's not even funny, if what I'm imagining is actually what's happening
in reply to the esoteric programmer

@esoteric_programmer
Other than the text completion part, you are almost correct.
You give it some text, and an audio sample, and it tries to replicate the given voices characteristics.
Research is active in the areas of speaker verification and audio deep fake detection to combat misuse.
in reply to Musharraf

aha, interesting! could I make it generate, say, something like a podcast? would I have to generate each part of the dialog in turn, then splice replies into the result?
in reply to Musharraf

I dk about 11 labs and such, I don't use those, don't intend to either, but this seemns like it'd be explicitly used for such a thing, by a lot and a lot of people, much easier than with 11 labs. Yeah, I could just be imagining this wrong and blowing it out of proportion in my mind, that's always a possibility
in reply to the esoteric programmer

The short answer to this is, anything can be misused. Someone could create fake clips, but someone could also create an audio book narrated by their favorite narrator for personal use, or someone could create an app that uses a loved one's lost voice if they only have a few audio clips as well.
in reply to Musharraf

@mush42 @esoteric_programmer While the model is really really good, I find it has problems when trying to convert text that's more than a few lines. It will start splicing parts of the audio prompt into the result or just go kinda insane. I've only tried the HF space so far, but want to try runningn on my Mac tomorrow to see if I can get better results. If I could get this to read articles to me that would be great.