The State of Modern AI Text To Speech Systems for Screen Reader Users: The past year has seen an explosion in new text to speech engines based on neural networks, large language models, and machine learning. But has any of this advancement offered anything to those using screen readers? stuff.interfree.ca/2026/01/05/ai-tts-for-screenreaders.html#ai#tts#llm#accessibility#a11y#screenreaders
reshared this
Landon
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦
in reply to Landon • • •PepperTheVixen ΘΔ
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦
in reply to PepperTheVixen ΘΔ • • •Devin Prater :blind:
in reply to PepperTheVixen ΘΔ • • •🇨🇦Samuel Proulx🇨🇦
in reply to Devin Prater :blind: • • •Devin Prater :blind:
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦
in reply to Devin Prater :blind: • • •Amir
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦 likes this.
Sean Randall
in reply to Amir • • •Do you know if codefactory are doing the same with their new android build?
Andre Louis
in reply to Sean Randall • • •James Scholes
in reply to Andre Louis • • •There is a 32-bit compatibility layer in the works for NVDA itself (although it currently only references SAPI4). But with any luck the need for every add-on to implement its own will go away.
github.com/nvaccess/nvda/pull/…
@cachondo @amir @fastfinge
Support for SAPI4 via a 32 bit shim runtime by michaelDCurran · Pull Request #19412 · nvaccess/nvda
GitHubSean Randall
in reply to James Scholes • • •🇨🇦Samuel Proulx🇨🇦
in reply to Sean Randall • • •Sean Randall
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦
in reply to James Scholes • • •James Scholes
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦
in reply to James Scholes • • •James Scholes
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •I see the "Secure add-on runtime" on the roadmap, with the note that "The first version of the runtime will provide support for speech synthesis and braille devices."
I don't see any implication that any 32-bit compatibility layer will only work for secure add-ons, which is hopefully a bit of a leap.
Still, the fact that people don't know what will or won't be happening, or whether their preferred synthesiser(s) will work or not, continues to be a big part of the problem. @cachondo @FreakyFwoof @amir
🇨🇦Samuel Proulx🇨🇦
in reply to James Scholes • • •Cleverson
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦 likes this.
Cleverson
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦 likes this.
🇨🇦Samuel Proulx🇨🇦
in reply to Cleverson • • •🇨🇦Samuel Proulx🇨🇦
in reply to Andre Louis • • •@FreakyFwoof@cachondo@amir You should be able to get either Gemini or Codex to help you, depending on what AI you have access to. The workflow would be:
1. download gemini-cli or codex-cli, and get them installed and configured.
2. clone all of the sourcecode from github.com/fastfinge/eloquence_64/3. Delete the tts.txt and tts.pdf files, so you don't confuse it with incorrect documentation.
4. Find any API documentation for orphius that's available, and add it into the folder.
4. Run codex-cli or gemini-cli, and tell it something like: "Using the information about how to develop NVDA addons you can find in agents.md, and the information about the Orphius API I've provided in the file Orphius-documentation-filename.txt, I would like you to modify the code in this folder to work with Orpheus instead of eloquence."
It will go away for five or ten minutes, ask you for permission to read and write the files it's interested in, and then give you something that mostly works. Now, build the addon, run it, and tell it about the errors and problems you have and ask it to fix them. In the case of errors, include the error right from the NVDA log, and for bugs and problems, tell it exactly what it's doing wrong, and exactly what you want it to do instead. Keep doing this until you wind up with a working addon.
Think of AI as a particularly stupid programmer, and you're the manager in charge of the project. You should be able to get this done without paying anyone.
GitHub - fastfinge/eloquence_64: Eloquence synthesizer NVDA add-on compatible with 64-bit versions of NVDA
GitHubjohann reshared this.
Andre Louis
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦
in reply to Andre Louis • • •Hamish
in reply to Andre Louis • • •Luis Carlos
in reply to Andre Louis • • •🇨🇦Samuel Proulx🇨🇦
in reply to Luis Carlos • • •Luis Carlos
in reply to Andre Louis • • •🇨🇦Samuel Proulx🇨🇦
in reply to Luis Carlos • • •🇨🇦Samuel Proulx🇨🇦
in reply to Sean Randall • • •Andre Louis
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦 likes this.
Chris Smart
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦
in reply to Chris Smart • • •Zach Bennoui
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •Really interesting article. I'm particularly passionate about this subject, I've been fascinated with TTS for a number of years. I've trained many voices, both for Piper and some of the newer LLM based systems, and while I can't speak to the speed issue, training data is extremely important.
What you feed into these models has a big impact on the voice's performance overall. If you give it stuff scrape from the web, random audiobooks that weren't optimized for TTS, things like that, you're not going to get good results for the type of work screen reader users do every day. This applies to all of these systems, not even just neural networks. The latency / responsiveness issue is something we'll have to solve at some point, because I don't think using TTS systems last updated in 2003 is going to work out in the longterm, as much as I love Eloquence.
In my ideal world, we would have either a machine learning based or formant system that is easy to train / maintain. Big companies have lost interest in on device TTS, not even just for screen reader users. Many of the solutions being put out now are cloud based, and while developers are still creating on device models, as said in the article, they're not optimized for our needs and may never be. I think we have to take matters into our own hands and figure this out, but I believe with enough people we can make it happen.
🇨🇦Samuel Proulx🇨🇦 likes this.
🇨🇦Samuel Proulx🇨🇦
in reply to Zach Bennoui • • •🇨🇦Samuel Proulx🇨🇦
in reply to Zach Bennoui • • •Zach Bennoui
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦
in reply to Zach Bennoui • • •Cleverson
in reply to Zach Bennoui • • •🇨🇦Samuel Proulx🇨🇦 likes this.
Chi Kim
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦
in reply to Chi Kim • • •Jayson Smith
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •🇨🇦Samuel Proulx🇨🇦
in reply to Jayson Smith • • •Paul L
in reply to 🇨🇦Samuel Proulx🇨🇦 • • •isn't it possible to "pregenerate" the speech with all the necessary IDs so that you can navigate and interrupt at will?
Just as one generates SSML from rich text (including maths formulas) before generating speech.
It would even be better to catch intonations, breaths and others, unchanged instead of letting the TTS generating a "pleasant full phrase" (a wrong expectation).
I find your post intriguingly close to the emerging reaction against the Ai-generated #mundaneslop
.
🇨🇦Samuel Proulx🇨🇦
in reply to Paul L • • •@polx Maybe, but probably not. Doing that would result in a lot of wasted resources generating text I'm never going to listen to. Think about the average user interface: dozens of menus, and toolbars, and ads, and comments, and so on. Plus, the text changes constantly, on even simple websites. That's not even taking into account websites that just scroll constantly. It might be possible to create some kind of algorithm to predict the most likely text I'll want next, but now we've just added another AI on top of the first AI.
I think a better solution might be to make the text to speech system run on different hardware from the computer itself. This is, in fact, how text to speech was done in the past, before computers had multi-channel soundcards. This has a few advantages. First, even if the computer itself is busy, the speech never crashes or falls behind. Second, if the computer crashes, it could be possible to actually read out the last error encountered. Third, specialized devices could be perhaps more power and CPU efficient.
The reason text to speech systems became software, instead of hardware, is largely because of cost. It's much cheaper to just download and install a program than it is to purchase another device. Also, it means you don't have to carry around another dongle and plug it into the computer.