🇨🇦Samuel Proulx🇨🇦

3 weeks ago

🇨🇦Samuel Proulx🇨🇦
3 weeks ago

The State of Modern AI Text To Speech Systems for Screen Reader Users: The past year has seen an explosion in new text to speech engines based on neural networks, large language models, and machine learning. But has any of this advancement offered anything to those using screen readers? stuff.interfree.ca/2026/01/05/ai-tts-for-screenreaders.html#ai#tts#llm#accessibility#a11y#screenreaders

Sam's Stuff - The State of Modern AI Text To Speech Systems for Screen Reader Users

^{stuff.interfree.ca}

#a11y #Accessibility #AI #screenreaders #llm #tts

reshared this

in reply to 🇨🇦Samuel Proulx🇨🇦

Landon

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

Oh god, what if Chatgpt gets intergrateded into nvda?

in reply to Landon

🇨🇦Samuel Proulx🇨🇦

in reply to Landon 3 weeks ago

@Landon205 There's already addons that do that.

@Landon

in reply to 🇨🇦Samuel Proulx🇨🇦

PepperTheVixen ΘΔ

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

I've started using eSpeak-ng. It's grating, but I can crank the speed up way higher than any other TTS I've ever used, especially the fancy AI shit that simulates breath draws and lip movement

in reply to PepperTheVixen ΘΔ

🇨🇦Samuel Proulx🇨🇦

in reply to PepperTheVixen ΘΔ 3 weeks ago

@PepperTheVixen The reason it's grating is because unlike Eloquence and dectalk, Espeak only uses formant synthesis for the vowel sounds. For consonants and plosives, it instead uses concatenative recordings based on human speech. That's why even when you switch to a voice that sounds less sharp, the "t", "b", "p", and other sounds are still too sharp. This seems to be the primary cause of the fatigue most people experience while using ESpeak.

@PepperTheVixen ΘΔ

in reply to PepperTheVixen ΘΔ

Devin Prater :blind:

in reply to PepperTheVixen ΘΔ 3 weeks ago

Lol just imagining an AI voice with lip smacking noises.

in reply to Devin Prater :blind:

🇨🇦Samuel Proulx🇨🇦

in reply to Devin Prater :blind: 3 weeks ago

@pixelate@PepperTheVixen If you give chatterbox-tts an ASMR recording to clone, you can absolutely get it to make lip smacking noises.

@PepperTheVixen ΘΔ @Devin Prater :blind:

in reply to 🇨🇦Samuel Proulx🇨🇦

Devin Prater :blind:

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

Oh my goodness. Or even better, an AI voice chewing gum.

in reply to Devin Prater :blind:

🇨🇦Samuel Proulx🇨🇦

in reply to Devin Prater :blind: 3 weeks ago

@pixelate@PepperTheVixen If you have a sample of someone talking while chewing gum, you can absolutely make that happen.

@PepperTheVixen ΘΔ @Devin Prater :blind:

in reply to 🇨🇦Samuel Proulx🇨🇦

Amir

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

What an interesting read! Needless to say, I read it with Eloquence - LOL!

🇨🇦Samuel Proulx🇨🇦 likes this.

in reply to Amir

Sean Randall

in reply to Amir 3 weeks ago

It's crazy that everyone is layering it in wrappers nowadays.
Do you know if codefactory are doing the same with their new android build?

in reply to Sean Randall

Andre Louis

in reply to Sean Randall 3 weeks ago

I sincerely hope someone will do the same for Orpheus. I'd even pay for it.

in reply to Andre Louis

James Scholes

in reply to Andre Louis 3 weeks ago

There is a 32-bit compatibility layer in the works for NVDA itself (although it currently only references SAPI4). But with any luck the need for every add-on to implement its own will go away.

github.com/nvaccess/nvda/pull/…

@cachondo @amir @fastfinge

Support for SAPI4 via a 32 bit shim runtime by michaelDCurran · Pull Request #19412 · nvaccess/nvda

Link to issue number: Summary of the issue: 64 bit NVDA can no longer support SAPI4. Description of user facing changes: SAPI4 is again availble on 64 bit NVDA. Description of developer facing cha...

^GitHub

@Sean Randall @🇨🇦Samuel Proulx🇨🇦 @Amir

in reply to James Scholes

Sean Randall

in reply to James Scholes 3 weeks ago

It does seem incredible to cut every 32 bit thing out so suddenly.

in reply to Sean Randall

🇨🇦Samuel Proulx🇨🇦

in reply to Sean Randall 3 weeks ago

@cachondo@jscholes@FreakyFwoof@amir They don't have much choice. A lot of the libraries NVDA depends on are stopping 32-bit support this year.

@James Scholes @Sean Randall @Andre Louis @Amir

in reply to 🇨🇦Samuel Proulx🇨🇦

Sean Randall

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

I guess if this had happened a decade ago it'd have excited me. I'm obviously getting too old!

in reply to James Scholes

🇨🇦Samuel Proulx🇨🇦

in reply to James Scholes 3 weeks ago

@jscholes@cachondo@FreakyFwoof@amir My understanding is that when this comes to addons, it's going to require some kind of secure addons API/layer. And it won't be ready for 2026.1, or maybe not even 2026.2.

@James Scholes @Sean Randall @Andre Louis @Amir

in reply to 🇨🇦Samuel Proulx🇨🇦

James Scholes

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

Where are you getting the first part of that understanding from? I.e. the dependence on the secure add-on runtime. @cachondo @FreakyFwoof @amir

@Sean Randall @Andre Louis @Amir

in reply to James Scholes

🇨🇦Samuel Proulx🇨🇦

in reply to James Scholes 3 weeks ago

@jscholes@cachondo@FreakyFwoof@amir It was mentioned in the roadmap NVDA released a while back.

@James Scholes @Sean Randall @Andre Louis @Amir

in reply to 🇨🇦Samuel Proulx🇨🇦

James Scholes

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

I see the "Secure add-on runtime" on the roadmap, with the note that "The first version of the runtime will provide support for speech synthesis and braille devices."

I don't see any implication that any 32-bit compatibility layer will only work for secure add-ons, which is hopefully a bit of a leap.

Still, the fact that people don't know what will or won't be happening, or whether their preferred synthesiser(s) will work or not, continues to be a big part of the problem. @cachondo @FreakyFwoof @amir

@Sean Randall @Andre Louis @Amir

in reply to James Scholes

🇨🇦Samuel Proulx🇨🇦

in reply to James Scholes 3 weeks ago

@jscholes@cachondo@FreakyFwoof@amir That's my assumption because the only things that really need a 32-bit compatibility layer are speech synthesizers and braille devices.

@James Scholes @Sean Randall @Andre Louis @Amir

in reply to 🇨🇦Samuel Proulx🇨🇦

Cleverson

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

Regarding ESpeak-ng, AFAIC, the main complaint from users is its base tone, which cannot be solved by simply making new variants. In this regard, how about improving its MBrola voices?

🇨🇦Samuel Proulx🇨🇦 likes this.

in reply to 🇨🇦Samuel Proulx🇨🇦

Cleverson

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

And what about recording new voices for RHVoice?

🇨🇦Samuel Proulx🇨🇦 likes this.

in reply to Cleverson

🇨🇦Samuel Proulx🇨🇦

in reply to Cleverson 3 weeks ago

@clv1@jscholes@cachondo@FreakyFwoof@amir The issue is that both of these are effectively concatenative or parametric, rather than formant, systems. So they will never be as intelligible as eloquence.

@James Scholes @Sean Randall @Andre Louis @Amir @Cleverson

in reply to Andre Louis

🇨🇦Samuel Proulx🇨🇦

in reply to Andre Louis 3 weeks ago

@FreakyFwoof@cachondo@amir You should be able to get either Gemini or Codex to help you, depending on what AI you have access to. The workflow would be:
1. download gemini-cli or codex-cli, and get them installed and configured.
2. clone all of the sourcecode from github.com/fastfinge/eloquence_64/3. Delete the tts.txt and tts.pdf files, so you don't confuse it with incorrect documentation.
4. Find any API documentation for orphius that's available, and add it into the folder.
4. Run codex-cli or gemini-cli, and tell it something like: "Using the information about how to develop NVDA addons you can find in agents.md, and the information about the Orphius API I've provided in the file Orphius-documentation-filename.txt, I would like you to modify the code in this folder to work with Orpheus instead of eloquence."

It will go away for five or ten minutes, ask you for permission to read and write the files it's interested in, and then give you something that mostly works. Now, build the addon, run it, and tell it about the errors and problems you have and ask it to fix them. In the case of errors, include the error right from the NVDA log, and for bugs and problems, tell it exactly what it's doing wrong, and exactly what you want it to do instead. Keep doing this until you wind up with a working addon.

Think of AI as a particularly stupid programmer, and you're the manager in charge of the project. You should be able to get this done without paying anyone.

GitHub - fastfinge/eloquence_64: Eloquence synthesizer NVDA add-on compatible with 64-bit versions of NVDA

Eloquence synthesizer NVDA add-on compatible with 64-bit versions of NVDA - fastfinge/eloquence_64

^GitHub

@Sean Randall @Andre Louis @Amir

johann reshared this.

in reply to 🇨🇦Samuel Proulx🇨🇦

Andre Louis

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

Well there's already a 32-bit addon for Orpheus floating about. I'd still rather pay someone competent to do it, even if they use AI. Proper programming terms would help narrow down the broken bits. I'm just an audio guy.

in reply to Andre Louis

🇨🇦Samuel Proulx🇨🇦

in reply to Andre Louis 3 weeks ago

@FreakyFwoof@cachondo@amir Yeah, you can get AI to modify the 32-bit addon for you. That's how I got the first two eloquence prototypes; it helped me understand the problem and what approaches would work and what wouldn't. If you give it the 32-bit orphius addon, and the 64-bit eloquence addon, it should be able to understand the working approach to make an addon 64-bit, and make the modifications itself. The reason to give it the 64-bit eloquence addon as an example is so it doesn't decide to go down the GRPC route and include protobuf and a bunch of other nonsense.

@Sean Randall @Andre Louis @Amir

in reply to Andre Louis

Hamish

in reply to Andre Louis 3 weeks ago

Oh happy days 😊 that was the voice that used to come with the Hal ScreenReader isn't it? That was my first ScreenReader after my accident back in 1996 and I seem to remember the plug-in synth was called something like Apollo two or thereabouts such happy memories 🙂 but not really I used to sit up till about 4 am banging my head against the brick wall trying to figure it out but hey ho

in reply to Andre Louis

Luis Carlos

in reply to Andre Louis 3 weeks ago

And also for Kokoro and other speech synths.

in reply to Luis Carlos

🇨🇦Samuel Proulx🇨🇦

in reply to Luis Carlos 3 weeks ago

@luiscarlosgonzalez@cachondo@FreakyFwoof@amir I didn't try Kokoro, because it cannot achieve a real time factor of 1 on CPU. By that I mean, to be fit for consideration with a screen reader, a text to speech voice must be able to generate one second of speech in one second or faster. In general, Kokoro takes two seconds to generate one second of speech. So it's not suitable.

@Sean Randall @Luis Carlos @Andre Louis @Amir

in reply to Andre Louis

Luis Carlos

in reply to Andre Louis 3 weeks ago

What even about Tortoise or Cookie or that synth I don't know of

in reply to Luis Carlos

🇨🇦Samuel Proulx🇨🇦

in reply to Luis Carlos 3 weeks ago

@luiscarlosgonzalez@cachondo@FreakyFwoof@amir It has the same problem with speed.

@Sean Randall @Luis Carlos @Andre Louis @Amir

in reply to Sean Randall

🇨🇦Samuel Proulx🇨🇦

in reply to Sean Randall 3 weeks ago

@cachondo@amir I've heard from a second hand source that they are, yes. But I haven't verified that.

@Sean Randall @Amir

in reply to 🇨🇦Samuel Proulx🇨🇦

Andre Louis

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

Supertonic works way better on my 2012 machine than Kitn. That one just cuts off things all the time and just doesn't sound good at all. I'm pretty sure it's not supposed to do that. Very weird.

🇨🇦Samuel Proulx🇨🇦 likes this.

in reply to 🇨🇦Samuel Proulx🇨🇦

Chris Smart

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

Malware Bytes advises me not to visit this page, and that's after I paste the URL in my browser because Tweesecake doesn't recognize it as a URL. :)

in reply to Chris Smart

🇨🇦Samuel Proulx🇨🇦

in reply to Chris Smart 3 weeks ago

@VE3RWJ Shrug. Nobody else has reported that issue. Probably a false positive from malwarebites.

@Chris Smart

in reply to 🇨🇦Samuel Proulx🇨🇦

Zach Bennoui

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

Really interesting article. I'm particularly passionate about this subject, I've been fascinated with TTS for a number of years. I've trained many voices, both for Piper and some of the newer LLM based systems, and while I can't speak to the speed issue, training data is extremely important.

What you feed into these models has a big impact on the voice's performance overall. If you give it stuff scrape from the web, random audiobooks that weren't optimized for TTS, things like that, you're not going to get good results for the type of work screen reader users do every day. This applies to all of these systems, not even just neural networks. The latency / responsiveness issue is something we'll have to solve at some point, because I don't think using TTS systems last updated in 2003 is going to work out in the longterm, as much as I love Eloquence.

In my ideal world, we would have either a machine learning based or formant system that is easy to train / maintain. Big companies have lost interest in on device TTS, not even just for screen reader users. Many of the solutions being put out now are cloud based, and while developers are still creating on device models, as said in the article, they're not optimized for our needs and may never be. I think we have to take matters into our own hands and figure this out, but I believe with enough people we can make it happen.

🇨🇦Samuel Proulx🇨🇦 likes this.

in reply to Zach Bennoui

🇨🇦Samuel Proulx🇨🇦

in reply to Zach Bennoui 3 weeks ago

@ZBennoui Agreed. I think blast bay is close to the right track. If only it was open and the issues pronouncing words were fixed. The speed and sound of the voices are top notch.

@Zach Bennoui

in reply to Zach Bennoui

🇨🇦Samuel Proulx🇨🇦

in reply to Zach Bennoui 3 weeks ago

@ZBennoui We need a good formant system. Machine learning is useful for setting the model parameters. But I think the word to phoneme rules can’t be a neural network, because they have to be reproducible and modifiable. Even here though, machine learning could help though. I’d love a system where a user could submit a recording of a word, and the system could create the phonetic representation.

@Zach Bennoui

in reply to 🇨🇦Samuel Proulx🇨🇦

Zach Bennoui

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

Yeah I completely agree, I happen to know Philip and have been talking with him extensively about his experiments with TTS. I can't go into a ton of detail, but I'll say what he said publicly. The system he's using is a hybrid approach of neural networks and formant synthesis, where he trains a model to output formant frequencies based on the audio data he feeds into it. I won't pretend to understand all the details, this is way above my pay grade, but as far as I understand this has never been done before by another developer.

This entry was edited (3 weeks ago)

in reply to Zach Bennoui

🇨🇦Samuel Proulx🇨🇦

in reply to Zach Bennoui 3 weeks ago

@ZBennoui Yup. I just wish he wasn’t also trying to train his own phonemizer, because I really believe that has to be reproducible and modifiable for users. I’ve swapped multiple emails with him about an NVDA addon. But he’s pretty set on sapi for now until things stabilize both on the NVDA side and on his side.

@Zach Bennoui

in reply to Zach Bennoui

Cleverson

in reply to Zach Bennoui 3 weeks ago

@amir I too think we can make it happen by taking matters into our hands. I don't know how to code, but I'm at disposal to work on Portuguese language support, e.g. improving pronounciation rules, when time comes.

@Amir

🇨🇦Samuel Proulx🇨🇦 likes this.

in reply to 🇨🇦Samuel Proulx🇨🇦

Chi Kim

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

@FreakyFwoof The major problem with incorporating AI TTS into screen readers is latency. Maybe you can use it for say all, but it is not suitable for navigation. There are a couple of tiny TTS with low latency like Piper TTS, but the quality is not the best. Also multi lingual support and pronunciation for many uncommon words are issues for AI tts.

@Andre Louis

in reply to Chi Kim

🇨🇦Samuel Proulx🇨🇦

in reply to Chi Kim 3 weeks ago

@chikim@FreakyFwoof That's one of the problems, yes.

@Chi Kim @Andre Louis

in reply to 🇨🇦Samuel Proulx🇨🇦

Jayson Smith

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

I assume you didn't mention the modern efforts with regard to DECtalk due to the legal situation with that source code being about as clear as mud?

in reply to Jayson Smith

🇨🇦Samuel Proulx🇨🇦

in reply to Jayson Smith 3 weeks ago

@jaybird110127 Yup. That and the fact that the sourcecode isn't getting updated; just getting it to keep compiling is a huge effort. There is a 64-bit build, but it doesn't actually work. I consider dectalk pretty much dead, even though the source is available.

@Jayson Smith

in reply to 🇨🇦Samuel Proulx🇨🇦

Paul L

in reply to 🇨🇦Samuel Proulx🇨🇦 3 weeks ago

isn't it possible to "pregenerate" the speech with all the necessary IDs so that you can navigate and interrupt at will?
Just as one generates SSML from rich text (including maths formulas) before generating speech.

It would even be better to catch intonations, breaths and others, unchanged instead of letting the TTS generating a "pleasant full phrase" (a wrong expectation).

I find your post intriguingly close to the emerging reaction against the Ai-generated #mundaneslop .

#mundaneslop

in reply to Paul L

🇨🇦Samuel Proulx🇨🇦

in reply to Paul L 3 weeks ago

@polx Maybe, but probably not. Doing that would result in a lot of wasted resources generating text I'm never going to listen to. Think about the average user interface: dozens of menus, and toolbars, and ads, and comments, and so on. Plus, the text changes constantly, on even simple websites. That's not even taking into account websites that just scroll constantly. It might be possible to create some kind of algorithm to predict the most likely text I'll want next, but now we've just added another AI on top of the first AI.

I think a better solution might be to make the text to speech system run on different hardware from the computer itself. This is, in fact, how text to speech was done in the past, before computers had multi-channel soundcards. This has a few advantages. First, even if the computer itself is busy, the speech never crashes or falls behind. Second, if the computer crashes, it could be possible to actually read out the last error encountered. Third, specialized devices could be perhaps more power and CPU efficient.

The reason text to speech systems became software, instead of hardware, is largely because of cost. It's much cheaper to just download and install a program than it is to purchase another device. Also, it means you don't have to carry around another dongle and plug it into the computer.

@Paul L

in reply to 🇨🇦Samuel Proulx🇨🇦

Matthew

in reply to 🇨🇦Samuel Proulx🇨🇦 2 weeks ago

I have been developing a neural TTS system, focused on screen reading for many months, which offers instant responsiveness, but maintains good synthesis quality at the same time. And, BTW, it is not recommended at all to use espeak as a phonemizer backend as breaks the text embeddings during model training, especially if we use linguistic information. And, please consider to avoid overeading NVDA's python environment in your add-ons.

in reply to Matthew

🇨🇦Samuel Proulx🇨🇦

in reply to Matthew 2 weeks ago

@rmcpantoja Yes, the only way to avoid messing with the NVDA Python environment would be to do an IPC server. But at that point, you're really just rewriting SAPI and it seems pointless.

@Matthew

in reply to Matthew

🇨🇦Samuel Proulx🇨🇦

in reply to Matthew 2 weeks ago

@rmcpantoja The issue with not using Espeak is that it makes it impossible to have user dictionaries. When we use a neural network, linguistic rules are no longer deterministic. So it might say a word correctly with one voice, at one time, but not with another voice, or at another time. This makes it impossible for us to correct mispronounced words in a reliable way.

@Matthew

in reply to 🇨🇦Samuel Proulx🇨🇦

Matthew

in reply to 🇨🇦Samuel Proulx🇨🇦 2 weeks ago

My results shows that a dedicated IPC server performs faster, E.G the synthDriver is ready for use in just 5 seconds; response is good even in longer sentences, but this can be attributed to the 4.2m model I'm using. And when I ran the model through a streaming vocoder, response is surprisingly realtime, suitable for screen reader. As for voice rate, I'm using a modification of the good "audiostretchy" pip package. I can't give more details ATM, but I hope this helps in your research

in reply to Matthew

🇨🇦Samuel Proulx🇨🇦

in reply to Matthew 2 weeks ago

@rmcpantoja Also, I'd love to hear if and when you release anything!

@Matthew

in reply to 🇨🇦Samuel Proulx🇨🇦

shortwavesurfer2009

in reply to 🇨🇦Samuel Proulx🇨🇦 5 days ago

github.com/RHVoice/RHVoice

GitHub - RHVoice/RHVoice: a free and open source speech synthesizer for Russian and other languages

a free and open source speech synthesizer for Russian and other languages - RHVoice/RHVoice

^GitHub

in reply to shortwavesurfer2009

🇨🇦Samuel Proulx🇨🇦

in reply to shortwavesurfer2009 5 days ago

@shortwavesurfer2009 The problem with RHVoice is that it's concatinative (with some hidden markov model tricks for phoneme selection). So it will never be as easy to understand at extremely high rates of speed as eloquence.

@shortwavesurfer2009

⇧

🇨🇦Samuel Proulx🇨🇦 3 weeks ago • •

🇨🇦Samuel Proulx🇨🇦
3 weeks ago