Trying to use Elevenlabs to read out a solar forecast for a new feature being added to @BlindHams, and... yeah, it does odd things when encountering numbers if you don't specifically lock it to English, which, BTW, I don't yet know how to do through the API.

The easy fix is to just spell all numbers out as if they are spoken.

Some generations are fine, others are completely broken, and you never know which you'll get.

Here is a pretty terrible example.

in reply to Steve

This is the thing, if they were even a little transparent about what exactly they were doing under the hood I could come up with at least an educated guess as to why it struggle so much with numbers and other symbols, but I have no idea. I'm pretty sure the model they're using is some kind of LLM architecture, similar to many recently released open-source TTS models, and those are notorious for struggling with numbers and other symbols without text normalization being done. I'll be honest, Eleven is actually behind the competition from the likes of Amazon and Microsoft, and many of the open source options not only sound just as good, but are much more customizable as well.
This entry was edited (3 weeks ago)
in reply to Borris

@sclower Well you'll certainly have a lot of options to pick from. Something that could be fun and potentially useful is to try fine-tuning a model on the type of content you're going to be generating, performance will likely improve drastically. These LLM based models are much easier to find tune than traditional ML TTS, I was able to train an Orpheus TTS model on my own voice in about ten minutes.