ok. ok so whoever created the @GrapheneOS tts engine....... you were optimising for maps and sighted user tts, weren't you? because this is soooooo sloooow. so much latency
in reply to solo

in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

@solonovamax I got harassed for being plural and a :neobot_floof: (not a human), with their band of Nazis from places like poa.st harassing me on their behalf after I defederated them (had to shoot down like 10 instances + a mastosoc army) :> I spoke briefly in their chatrooms, and they got pissy I used plural pronouns and said I'm not a human.
@solo
This entry was edited (Tuesday, May 26, 2026, 4:24β€―AM)
in reply to solo

from matrix

(them) That's very low quality though, and our feedback from blind users indicated that old TTS methods like that aren't usable (such as RHVoice).

See social.highenergymagic.net/@Gr…

It will be improved, but there's a starting point for everything. We didn't want to delay it too much, but the current target was similar latency to Google's Speech Recognition & Synthesis, but perhaps it isn't as fast at higher speeds?

Let's focus on actionable goals! Where could the latency be improved? Is it perhaps too slow at very high speeds?

(me) maybe a setting could be offered so that users can choose which they want?

(them) Maybe! Configuring the amount of "effort" to make it faster was something we wanted but it wasn't working out initially and we didn't want to delay the release for that.

But what are the exact latency issues? We need clear targets.

for example, currently the TTFA (time-to-first-audio) is ~150 ms on a Pixel 8a.

This entry was edited (Tuesday, May 26, 2026, 5:21β€―AM)
in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

No, performance was a major focus for it. It already has much lower latency than the existing options we were comparing against and there's huge room for improvement. It's the first public experimental release of the software and it has quirks to resolve including treating newlines as significant where reading newlines takes time and creates a barely audible sound. It doesn't use any form of hardware acceleration yet either. You're expecting too much from the first experimental release.
in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

It's a different type of software and is meant to become a competitive alternative to Google Speech Recognition & Synthesis. It has a lot of room for improvement including simply not reading newlines which can waste a lot of time if the text being passed to it doesn't have the whitespace stripped down.

It doesn't use hardware acceleration yet and was trained on a AMD RX 6600 prior to obtaining an RTX 5090. It can't be expected to keep up with Google's yet but it will get better.

in reply to GrapheneOS

the problem is not how it was trained, the problem is that it's a neural tts voice. that is always going to have very high latency, and you're also fighting with the android accessibility latency. please. contract with someone to develop something like dectalk or espeak. no training data, just raw waveform synthesis
in reply to GrapheneOS

It's already good enough for a lot of use cases and blind users can now use a fresh install of GrapheneOS without help. It's good enough to get through the setup wizard to install Google Speech Recognition & Synthesis or eSpeak NG.

It matters a lot which device you're testing on since it might be twice as fast on a Pixel 10 than a Pixel 6. It solely runs on the CPU right now and Pixels don't have a great CPU. It will benefit a lot from hardware acceleration, especially on Pixels.

in reply to GrapheneOS

a tts engine should not require hardware acceleration. it is a bloody waveform generator, but you're just assuming tts engines have to be neural-based. they don't. look at DecTalk, look at IBM TTS / Eloquence, look at something like that
in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

Our goal isn't implementing one of those. We want to have a natural voice capable of reading a lot more text well. It's entirely possible to make it perform very well. Tensor and Snapdragon both dedicate a lot of hardware to neural network acceleration. Pixels don't have competitive CPU performance and haven't focused on it. A major part of why recent Pixels don't have a better GPU is they dedicated a lot of space to the huge TPU and GrapheneOS doesn't yet use it beyond image processing.
in reply to GrapheneOS

then you're optimising for sighted users. ebook readers, map apps, that kind of thing. completely incorrect tts framework for this. a tts engine should not use the GPU, TPC, or anything other than the CPU. if the APple 2E can do it with an Echo card in 1985, yall can do it without leaning on weird neural tts shit
in reply to MaddieM4

@MaddieM4 Many blind users have told us they use Google's Speech Recognition & Synthesis. Why isn't it possible to provide a competitive open source implementation of what they provide? They're incredibly understaffed and make lots of bad decisions. We can use more bleeding edge technology than they can if we put resources into it.

Now a blind user can start with a fresh GrapheneOS install and set it up themselves without needing help from anyone including installing their preferred TTS.

in reply to GrapheneOS

@MaddieM4 because google can run at 600+% speech rate, and yours can't. neurall tts is always going to fucking suck anyway,m why did yall go for that and not just do your own clone of espeak?
in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

@MaddieM4 Many blind users told us they use Google's Speech Recognition & Synthesis and are happy with it. We're entirely capable of making ours perform as well as that. The first experimental release of our app isn't going to be competitive with theirs yet. It has a lot of straightforward bugs and performance issues to resolve.

We built our own network-based location too and we keep improving it. It went from barely working to extremely competitive. Our latest release improved it more.

in reply to GrapheneOS

very lengthy persuasive argument to reframe this as an engineering tradeoff, not hating or being negative at all, i love you, no response needed

Sensitive content

in reply to d@nny disc@

Sensitive content

in reply to GrapheneOS

It's not even the neural network where most of the CPU time is being spent right now. We used an AndroidX API for implementing speed and pitch configuration using the Java variant of the Sonic library. It also has a C variant which we could use directly. We could also likely find a more optimized library instead. Once that's much faster, neural network processing should become where most time is spent. We can use hardware acceleration available on every current/future device.
This entry was edited (Wednesday, May 27, 2026, 2:46β€―PM)
in reply to GrapheneOS

@hipsterelectron A large portion of every Tensor and Snapdragon SoC is dedicated to accelerating neural networks. It's a major factor in why Pixels don't have a much more powerful GPU because they dedicated die space to the TPU instead. GrapheneOS currently only uses the TPU as part of certain image/video processing functionality. HDR+ is mostly implemented in a more direct way but HDRnet processing for videos uses it including for real time processing of the video shown as a camera preview.
in reply to GrapheneOS

@hipsterelectron We aren't only implementing text-to-speech for English (US). We're implementing both text-to-speech and speech-to-text for at least around 10 languages. That means we need to build a reusable framework which we can highly optimize to use for all of it. We aren't going to be manually tuning it but rather choosing open data for training it. If there are weaknesses with the generated model then we can address that by adding more training data and more intensive training.
in reply to GrapheneOS

disabled users are often a very specific kind of power user, with specific power user needs. You have disabled people telling you this, and that a neural TTS will never operate at acceptable speeds even if you were able to leverage the TPU for latency.

I know y'all are very proud of the thing you made. But was it the correct cool thing to make? Survey says no. "Well Google solves this problem with the TPU and plenty of people cope with what they're given" is not a very good product design rationale.

in reply to MaddieM4

@MaddieM4 Our TTS and screen reader don't need to be the best options for blind users yet but rather good enough to set up the OS themselves including obtaining their preferred software. Prior to this OS release, it wasn't possible for a blind user to set up GrapheneOS themselves after installing it. It's now very straightforward.

Google doesn't release the TalkBack source code properly and some functionality depends on Play services. It's fine because it's good enough to obtain any app.

in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

@freya@highenergymagic.net @MaddieM4 It needs to be good enough to easily set up GrapheneOS the way people want it including installing their TTS and screen reader of choice. It doesn't need to meet the extremely high bar of being what someone who entirely relies on TTS chooses to use as a TTS engine to achieve the initial goals for it. Our Camera app is nowhere near as fancy as Pixel Camera but it works and people can install Pixel Camera if they want it. We don't expect our Camera app to satisfy a photographer.
This entry was edited (Tuesday, May 26, 2026, 5:03β€―AM)
in reply to GrapheneOS

@MaddieM4 it needs to be good enough to be usable day-to-day, and this isn't. shouldn't you want the default stuff built-in, and thus subject to GrapheneOS code quality and security verification, to be as good as possible? you wouldn't say that your web browser is good enough to go get chrtome from the play store, would you? so why drop the ball on this
in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

@MaddieM4 Most of the apps currently included with the OS are quite bad and need massive improvements. Vanadium has great security and basic Chromium functionality but is missing a lot of what people want and therefore a lot of people don't use it yet. We've been slowly improving it but soon we're going to have the resources needed to drastically improve it and the other apps. TTS had to start somewhere and this is the first experimental release which has a lot of issues left to resolve.
in reply to GrapheneOS

@MaddieM4 your tts needs to be as good as your verified boot, as your kernel security modifications. tts is not a feature, it is *structural*, it either gets *perfect*, which yall could have done by cloning espeak's method rather than doing vague non-parametric neural bullshit that doesn't scale in speed properly, it needs to be *really, really good*, or you go back to the drawingh board. please go back to the drawing board on this one.
in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

@MaddieM4 TTS being very good means it needs to have both high performance and high quality instead of only one or the other. Apple and Google are both aiming to do both and we should be too.

The latency should already be quite good but it doesn't yet work well at high speeds which isn't something we've been focusing on yet. There's a lot of room for improvement in multiple ways. It should be possible to make it an order of magnitude faster and various quirks also need to be fixed too.

in reply to GrapheneOS

The media in this post is not displayed to visitors. To view it, please go to the original post.

@MaddieM4 it will not, and cannot, work at high speed because of how you have created it. it cannot do it. dso you understand? neural tts does not scale to the speech rates I require. For context, *this* is what it needs to be able to sound like:
in reply to GrapheneOS

@MaddieM4 There are already significant improvements to the quality and performance prepared for the next release of the app. We don't have the massive architectural and acceleration changes which are possible implemented yet but we can do that. We'll also be able to reuse most of the work for other languages beyond English. English (US) is only the starting point. It needs to be a scalable approach where a small team can make software covering a lot of languages rather than only English.
in reply to GrapheneOS

@MaddieM4 yeah, and you're not going to get that with neural models, plus the fact it's generating samples from a model places a hard limit on how fast it can go, and how low-latency it can get. needing a powerful processor and/or TPU unbfairly hurts people who can only afford cheaper devices
in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

@MaddieM4 We are going to get drastically better performance. It's going to get a lot faster from making basic optimizations and architectural changes alone. It's going to get drastically faster from using the hardware acceleration that's available on both current and future devices. It can also have much better trained models with much more data and much more compute put into them which will enable using ones which can run faster. You're trying the first experimental public release.
in reply to GrapheneOS

@MaddieM4 no, you are failing to understand: it is not going to get screenreader-fast using a neural model. you cannot do it. you do not have the skills, the testers, the budget, the data, the hardware. You do not have it. You are not capable of it. admit when you are not capable of it, give up, cleanroom reverse engineer espeak (or, you know, shush and just ship GPL3 code and deal with it). I am telling you, as a blind power user, you are doing, the wrong, thing
in reply to GrapheneOS

You are putting AI in the audio equivalent to a text viewer, someone is complaining it's slow, and you're saying you're gonna make the AI better. Two engineers 40 years ago implemented a better solution to the problem you're solving on a computer with less than 64,000 bytes of available memory and your method is fundamentally broken because it is slower on a computer that is 1,000,000 times faster. What is so bad about not taking the soykaf route? Why are you so attached to this technique that doesn't match the needs of power users, the type you're trying to court with your system?
in reply to trinity

@3 We're planning on making text-to-speech and speech-to-text for at least around 10 languages. English (US), English (UK), German, French, Japanese, Spanish and Dutch would be a good starting point based on the top languages we estimate are used by our users based on where updates are downloaded. It depends on there being solid open data available for each language.

It isn't optimized yet. It spends most time in the library for adjusting speech and pitch rather than running a neural network.

in reply to GrapheneOS

@3 For neural networks, every device supported by GrapheneOS (Tensor) and devices we'll be supporting in the future (Snapdragon) has a large portion of the SoC dedicated to accelerating neural networks. The reason Pixels don't have a much more powerful GPU is because they dedicated the die space to the TPU. Our implementation isn't using the hardware functionality for accelerating it yet. We're doing the equivalent of rendering 3D graphics on the CPU. It will get drastically faster.
in reply to GrapheneOS

@3 The approach we're using will provide a much more natural sounding voice and much higher accuracy. We don't need to use an approach from 40 years ago with much lower quality output to provide the desired latency and throughput. It hasn't been optimized yet. We need to use the hardware acceleration a large portion of each smartphone SoC is dedicated to providing to make it very fast and power efficient. It's little different from a video game needing a GPU to run at 120 FPS instead of 3 FPS.
in reply to GrapheneOS

@3 We're currently doing the equivalent of making a video game with graphics matching games released a few years ago by entirely using the CPU to render it. We need to make a lot of basic improvements and optimizations to improve the quality, performance and power efficiency prior to adding hardware acceleration. Adding hardware acceleration will move it into a completely different performance/efficiency category. We've said it's an experimental early version and it's not very optimized yet.
in reply to GrapheneOS

@3 We want high quality text-to-speech output and we also want speech-to-text. We want it for more than English (US). Once we optimize it and implement hardware acceleration, it's going to be more than fast enough to provide the level of performance that's being requested and high power efficiency. We don't need to use an approach for drastically less powerful computers. We need basic optimizations and need to use the dedicated processor on Tensor/Snapdragon for accelerating that part of it.
in reply to GrapheneOS

@3 Most of the CPU time currently isn't spent doing neural network processing. The assumption that it's not as fast as wanted because of neural network processing is wrong. It can be made drastically faster without hardware acceleration and then using hardware acceleration will put it into a completely different performance category.

If there wasn't a GPU, the AOSP GUI would be very slow and janky due to always being rendered in 3D with Vulkan. It's very sensible and results in a better GUI.

in reply to GrapheneOS

@MaddieM4 was it tested at high speech rates? no. was it filtered to remove the weird echo it has? no. was it filtered to avoid weird inflection and pitch issues that, surprise fucking surprise, would not have happened if you used algorithmic generation? NO! yall tested it at the default speech rate, said "yep it speaks", and shipped it! Get rid of the neural model! nobody wants it, it's how you do tts in the 2020s I know because sighted people want to match the style of google tts- no! bad! hands off the neural models!
in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

@MaddieM4 It's the first experimental public release. We know it has assorted artifacts, performance issues, crashes and other quirks. It reached the point where it's useful and ready to make available for early public testing and that's what's happening right now. It's going to have massive improvements made to it from this starting point. It's the proof of concept phase and a lot more resources are going to go into it. We want to do a lot more than English (US) TTS in this area.
in reply to GrapheneOS

@MaddieM4 We plan to implement support for various other languages and also speed-to-text support too. We're building it in a way where we can do this with most of the work shared between them by relying on open data projects for training speech models in various languages. If we were only making an English (US) TTS for use with a screen reader then it could be approached differently. We're going to be taking on multiple more ambitious projects than usual in the near future not only this.
in reply to MaddieM4

@MaddieM4 exactly. blind users need a tiny, minimalist, fast tts engine.... hell, fuck, I can't even crank up the speech rate on the GrapheneOS tts last like 250% or it starts to glitch, and no, you can't fix that, you're using an under-trained model based on real human voice data, that's never going to work at high speeds, come on, yall can do better. I know yall can do better
in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

does espeak mean emacspeak here? is there an open source project or research paper you consider to be best in class that could be used as a reference for this approach?

i'm responding without tagging grapheneos bc your argument here aligns with my understanding of how to build expert tooling and i'm interested in helping to make this case / becoming more familiar with this technology. i'm about to reply to them separately too along these lines

in reply to Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š©

@Ra (Freyja) (it/its)π’€­π’ˆΉπ’ π’Š© I know you are trying to make the most of this for you and other power users, however I just wish to add that not all the blind users are so hooked up to eloquence TTS and only a few of them can use their TTS at such high speech rates. I have understood with this initial version of Graphene Speech services we can at least independently pass the initial setup of @GrapheneOS and then we can switch to our TTS synthesizer of choice. For example I am very unlikely to stop using #RHVoice and you are very unlikely to stop using #Eloquence no matter what do the Graphene OS folks implement.
Having really good modern open-source natural sounding TTS for reading books is not a waste as you are pointing out especially if they are planning to make the other direction part of the app i.e. speech to text as well.

We have the same thing with #NVDA on windows. By default it uses natural sounding voice built into windows that has the same mistakes you are pointing out here, then it has espeak-ng included as an opensource alternative capable of running at very very high speech rates. And you are still adding eloquence into your setup eventhough you need to pass extra hoops in order to even make it run on the modern hardware and software configurations.

It's a very good thing you are advocating for all the blind users here but still if I were you I would try to calm down a bit.

D.Hamlin.Music reshared this.

in reply to Peter VΓ‘gner

@pvagner We're planning on making text-to-speech and speech-to-text for at least around 10 languages. English (US), English (UK), German, French, Japanese, Spanish and Dutch would be a good starting point based on the top languages we estimate are used by our users based on where updates are downloaded. It depends on there being solid open data available for each language.

It isn't optimized yet. It spends most time in the library for adjusting speech and pitch, not neural processing.

in reply to GrapheneOS

@pvagner For neural networks, every device supported by GrapheneOS (Tensor) and devices we'll be supporting in the future (Snapdragon) has a large portion of the SoC dedicated to accelerating neural networks. The reason Pixels don't have a much more powerful GPU is because they dedicated the die space to the TPU. Our implementation isn't using the hardware functionality for accelerating it yet. We're doing the equivalent of rendering 3D graphics on the CPU. It will get drastically faster.
⇧