ok. ok so whoever created the @GrapheneOS tts engine....... you were optimising for maps and sighted user tts, weren't you? because this is soooooo sloooow. so much latency
@solonovamax yeah I just..... they told me that I'm expecting too much from an initial release. like no, I'm expecting the same latency profile as a P3 running windowes xp
@solonovamax I got harassed for being plural and a (not a human), with their band of Nazis from places like poa.st harassing me on their behalf after I defederated them (had to shoot down like 10 instances + a mastosoc army) :> I spoke briefly in their chatrooms, and they got pissy I used plural pronouns and said I'm not a human.
@solonovamax they tend to stalk anyone that pings them unless they and their Nazi band is defederated, the guy that runs the GrapheneOS account is SO chronically online he puts 99.9% of fedi to shame, esp. if you like link a fedi post or such
@tranquillity @solonovamax oh I have too many plural and therian friends to not be a petty bitch about this. Guess the choice between GrapheneOS, eOS, Sailfish and LineageOS just got a little simpler.
It will be improved, but there's a starting point for everything. We didn't want to delay it too much, but the current target was similar latency to Google's Speech Recognition & Synthesis, but perhaps it isn't as fast at higher speeds?
Let's focus on actionable goals! Where could the latency be improved? Is it perhaps too slow at very high speeds?
(me) maybe a setting could be offered so that users can choose which they want?
(them) Maybe! Configuring the amount of "effort" to make it faster was something we wanted but it wasn't working out initially and we didn't want to delay the release for that.
But what are the exact latency issues? We need clear targets.
for example, currently the TTFA (time-to-first-audio) is ~150 ms on a Pixel 8a.
No, performance was a major focus for it. It already has much lower latency than the existing options we were comparing against and there's huge room for improvement. It's the first public experimental release of the software and it has quirks to resolve including treating newlines as significant where reading newlines takes time and creates a barely audible sound. It doesn't use any form of hardware acceleration yet either. You're expecting too much from the first experimental release.
It's a different type of software and is meant to become a competitive alternative to Google Speech Recognition & Synthesis. It has a lot of room for improvement including simply not reading newlines which can waste a lot of time if the text being passed to it doesn't have the whitespace stripped down.
It doesn't use hardware acceleration yet and was trained on a AMD RX 6600 prior to obtaining an RTX 5090. It can't be expected to keep up with Google's yet but it will get better.
the problem is not how it was trained, the problem is that it's a neural tts voice. that is always going to have very high latency, and you're also fighting with the android accessibility latency. please. contract with someone to develop something like dectalk or espeak. no training data, just raw waveform synthesis
It's already good enough for a lot of use cases and blind users can now use a fresh install of GrapheneOS without help. It's good enough to get through the setup wizard to install Google Speech Recognition & Synthesis or eSpeak NG.
It matters a lot which device you're testing on since it might be twice as fast on a Pixel 10 than a Pixel 6. It solely runs on the CPU right now and Pixels don't have a great CPU. It will benefit a lot from hardware acceleration, especially on Pixels.
a tts engine should not require hardware acceleration. it is a bloody waveform generator, but you're just assuming tts engines have to be neural-based. they don't. look at DecTalk, look at IBM TTS / Eloquence, look at something like that
Our goal isn't implementing one of those. We want to have a natural voice capable of reading a lot more text well. It's entirely possible to make it perform very well. Tensor and Snapdragon both dedicate a lot of hardware to neural network acceleration. Pixels don't have competitive CPU performance and haven't focused on it. A major part of why recent Pixels don't have a better GPU is they dedicated a lot of space to the huge TPU and GrapheneOS doesn't yet use it beyond image processing.
then you're optimising for sighted users. ebook readers, map apps, that kind of thing. completely incorrect tts framework for this. a tts engine should not use the GPU, TPC, or anything other than the CPU. if the APple 2E can do it with an Echo card in 1985, yall can do it without leaning on weird neural tts shit
that doesn't seem to match what fully non-sighted users need, and those are the people who will have the most demanding needs and expectations, because their usability is fully contingent on TTS.
@MaddieM4 Many blind users have told us they use Google's Speech Recognition & Synthesis. Why isn't it possible to provide a competitive open source implementation of what they provide? They're incredibly understaffed and make lots of bad decisions. We can use more bleeding edge technology than they can if we put resources into it.
Now a blind user can start with a fresh GrapheneOS install and set it up themselves without needing help from anyone including installing their preferred TTS.
@MaddieM4 because google can run at 600+% speech rate, and yours can't. neurall tts is always going to fucking suck anyway,m why did yall go for that and not just do your own clone of espeak?
@MaddieM4 Many blind users told us they use Google's Speech Recognition & Synthesis and are happy with it. We're entirely capable of making ours perform as well as that. The first experimental release of our app isn't going to be competitive with theirs yet. It has a lot of straightforward bugs and performance issues to resolve.
We built our own network-based location too and we keep improving it. It went from barely working to extremely competitive. Our latest release improved it more.
very lengthy persuasive argument to reframe this as an engineering tradeoff, not hating or being negative at all, i love you, no response needed
Sensitive content
[edit: after writing all this i realize this is a topic i'm interested in researching further myself. if i follow up on this myself i will come back to you later with a more specific design proposal. i strongly urge your team to read this feedback as a proposal for future work. love you]
hopefully this feedback can be understood to represent a benchmark that the grapheneOS team can compare their work against, so that the team can be confident their work is not merely commensurate with google's best, but actively better. i'm also curious whether waveform generation leads to more easily auditable code that can be tuned to specific use cases by developers, instead of being blocked upon a centralized training processβsuch a setup also seems like it could make fixing bugs more difficult.
i am unfortunately not willing to put in the effort to prototype or research this alternative waveform-based non-statistical approach myself, so i don't plan to harangue you any further on this subject [edit: i will harangue if i can bring you a more concrete proposal]. i am chiming in because i believe it is worth your time to consider the feedback from user freya in the context of a potential longer-term research program which would require significant investment but may demonstrate equally significant returns in the form of end user empowerment.
the analogy i would draw here would be IDE indexing which is able to perform specific classes of queries, but isn't user extensible and limits programming language support, vs regex search within a code directory, which has deterministic behavior and responds to user feedback. an attempt at direct waveform generation as proposed by user freya may not remotely sound lifelike, but a machine is not supposed to sound pretty.
google translate several years ago switched from using classical NLP with per-language heuristics to a zero-shot unsupervised approach several years ago, and it immediately fell from being an expert tool (much like a dictionary that supports grammatical queries) down to something that assumes the user doesn't really care about nuance (this was before the current manifestation which additionally rewrites output to match its training data). i used the original version of the tool while intensively studying multiple languages, and as a semi-fluent speaker i now prefer a dictionary like jisho.org over the new translation paradigm.
i don't believe you're incorrect to claim that the current approach is satisfactory for users (nor do i believe freya is contesting that). i also very much understand why the statistical approach is attractive, because waveform generation is an extremely specialized area of study which tends to be limited to proprietary systems. the proposal to heuristically generate speech waveforms represents a risky investment, and the success criteria would be incomparable to google's offering. i understand the fear of being designated as "archaic" in the press for appearing to avoid the "modern" statistical approach.
i would urge you to adopt a less defensive stance on this matter and instead to see this as an engineering tradeoff. i furthermore urge you to consider why deterministic behavior might be attractive to expert users (particularly programmers) over statistical output that can't be easily specialized to specific contexts. i believe freya is giving you advice because she sees grapheneos as genuinely interested in doing the right thing.
@hipsterelectron Regardless of the technical details of how we convert text to speech and speech to text, it needs to be implemented based on automatic training from open data. It isn't feasible for GrapheneOS to manually hard-wire and tune text-to-speech output for a bunch of language. It would be entirely possible to output a faster and less natural sounding model not using a neural network. However, a neural network should be more than fast enough to achieved the requested latency/throughput.
It's not even the neural network where most of the CPU time is being spent right now. We used an AndroidX API for implementing speed and pitch configuration using the Java variant of the Sonic library. It also has a C variant which we could use directly. We could also likely find a more optimized library instead. Once that's much faster, neural network processing should become where most time is spent. We can use hardware acceleration available on every current/future device.
This entry was edited (Wednesday, May 27, 2026, 2:46β―PM)
@hipsterelectron A large portion of every Tensor and Snapdragon SoC is dedicated to accelerating neural networks. It's a major factor in why Pixels don't have a much more powerful GPU because they dedicated die space to the TPU instead. GrapheneOS currently only uses the TPU as part of certain image/video processing functionality. HDR+ is mostly implemented in a more direct way but HDRnet processing for videos uses it including for real time processing of the video shown as a camera preview.
@hipsterelectron We aren't only implementing text-to-speech for English (US). We're implementing both text-to-speech and speech-to-text for at least around 10 languages. That means we need to build a reusable framework which we can highly optimize to use for all of it. We aren't going to be manually tuning it but rather choosing open data for training it. If there are weaknesses with the generated model then we can address that by adding more training data and more intensive training.
disabled users are often a very specific kind of power user, with specific power user needs. You have disabled people telling you this, and that a neural TTS will never operate at acceptable speeds even if you were able to leverage the TPU for latency.
I know y'all are very proud of the thing you made. But was it the correct cool thing to make? Survey says no. "Well Google solves this problem with the TPU and plenty of people cope with what they're given" is not a very good product design rationale.
@MaddieM4 Our TTS and screen reader don't need to be the best options for blind users yet but rather good enough to set up the OS themselves including obtaining their preferred software. Prior to this OS release, it wasn't possible for a blind user to set up GrapheneOS themselves after installing it. It's now very straightforward.
Google doesn't release the TalkBack source code properly and some functionality depends on Play services. It's fine because it's good enough to obtain any app.
@MaddieM4 ....................... "it's fine because it's good enough" no. that's not how this works. can I ask, do you have any blind developers who actually tested the fuck out of this?
@freya@highenergymagic.net @MaddieM4 It needs to be good enough to easily set up GrapheneOS the way people want it including installing their TTS and screen reader of choice. It doesn't need to meet the extremely high bar of being what someone who entirely relies on TTS chooses to use as a TTS engine to achieve the initial goals for it. Our Camera app is nowhere near as fancy as Pixel Camera but it works and people can install Pixel Camera if they want it. We don't expect our Camera app to satisfy a photographer.
@MaddieM4 it needs to be good enough to be usable day-to-day, and this isn't. shouldn't you want the default stuff built-in, and thus subject to GrapheneOS code quality and security verification, to be as good as possible? you wouldn't say that your web browser is good enough to go get chrtome from the play store, would you? so why drop the ball on this
@MaddieM4 Most of the apps currently included with the OS are quite bad and need massive improvements. Vanadium has great security and basic Chromium functionality but is missing a lot of what people want and therefore a lot of people don't use it yet. We've been slowly improving it but soon we're going to have the resources needed to drastically improve it and the other apps. TTS had to start somewhere and this is the first experimental release which has a lot of issues left to resolve.
@MaddieM4 your tts needs to be as good as your verified boot, as your kernel security modifications. tts is not a feature, it is *structural*, it either gets *perfect*, which yall could have done by cloning espeak's method rather than doing vague non-parametric neural bullshit that doesn't scale in speed properly, it needs to be *really, really good*, or you go back to the drawingh board. please go back to the drawing board on this one.
@MaddieM4 TTS being very good means it needs to have both high performance and high quality instead of only one or the other. Apple and Google are both aiming to do both and we should be too.
The latency should already be quite good but it doesn't yet work well at high speeds which isn't something we've been focusing on yet. There's a lot of room for improvement in multiple ways. It should be possible to make it an order of magnitude faster and various quirks also need to be fixed too.
The media in this post is not displayed to visitors. To view it, please go to the original post.
@MaddieM4 it will not, and cannot, work at high speed because of how you have created it. it cannot do it. dso you understand? neural tts does not scale to the speech rates I require. For context, *this* is what it needs to be able to sound like:
@MaddieM4 There are already significant improvements to the quality and performance prepared for the next release of the app. We don't have the massive architectural and acceleration changes which are possible implemented yet but we can do that. We'll also be able to reuse most of the work for other languages beyond English. English (US) is only the starting point. It needs to be a scalable approach where a small team can make software covering a lot of languages rather than only English.
@MaddieM4 yeah, and you're not going to get that with neural models, plus the fact it's generating samples from a model places a hard limit on how fast it can go, and how low-latency it can get. needing a powerful processor and/or TPU unbfairly hurts people who can only afford cheaper devices
@MaddieM4 We are going to get drastically better performance. It's going to get a lot faster from making basic optimizations and architectural changes alone. It's going to get drastically faster from using the hardware acceleration that's available on both current and future devices. It can also have much better trained models with much more data and much more compute put into them which will enable using ones which can run faster. You're trying the first experimental public release.
@MaddieM4 no, you are failing to understand: it is not going to get screenreader-fast using a neural model. you cannot do it. you do not have the skills, the testers, the budget, the data, the hardware. You do not have it. You are not capable of it. admit when you are not capable of it, give up, cleanroom reverse engineer espeak (or, you know, shush and just ship GPL3 code and deal with it). I am telling you, as a blind power user, you are doing, the wrong, thing
You are putting AI in the audio equivalent to a text viewer, someone is complaining it's slow, and you're saying you're gonna make the AI better. Two engineers 40 years ago implemented a better solution to the problem you're solving on a computer with less than 64,000 bytes of available memory and your method is fundamentally broken because it is slower on a computer that is 1,000,000 times faster. What is so bad about not taking the soykaf route? Why are you so attached to this technique that doesn't match the needs of power users, the type you're trying to court with your system?
@3 We're planning on making text-to-speech and speech-to-text for at least around 10 languages. English (US), English (UK), German, French, Japanese, Spanish and Dutch would be a good starting point based on the top languages we estimate are used by our users based on where updates are downloaded. It depends on there being solid open data available for each language.
It isn't optimized yet. It spends most time in the library for adjusting speech and pitch rather than running a neural network.
@3 For neural networks, every device supported by GrapheneOS (Tensor) and devices we'll be supporting in the future (Snapdragon) has a large portion of the SoC dedicated to accelerating neural networks. The reason Pixels don't have a much more powerful GPU is because they dedicated the die space to the TPU. Our implementation isn't using the hardware functionality for accelerating it yet. We're doing the equivalent of rendering 3D graphics on the CPU. It will get drastically faster.
@3 The approach we're using will provide a much more natural sounding voice and much higher accuracy. We don't need to use an approach from 40 years ago with much lower quality output to provide the desired latency and throughput. It hasn't been optimized yet. We need to use the hardware acceleration a large portion of each smartphone SoC is dedicated to providing to make it very fast and power efficient. It's little different from a video game needing a GPU to run at 120 FPS instead of 3 FPS.
@3 We're currently doing the equivalent of making a video game with graphics matching games released a few years ago by entirely using the CPU to render it. We need to make a lot of basic improvements and optimizations to improve the quality, performance and power efficiency prior to adding hardware acceleration. Adding hardware acceleration will move it into a completely different performance/efficiency category. We've said it's an experimental early version and it's not very optimized yet.
@3 We want high quality text-to-speech output and we also want speech-to-text. We want it for more than English (US). Once we optimize it and implement hardware acceleration, it's going to be more than fast enough to provide the level of performance that's being requested and high power efficiency. We don't need to use an approach for drastically less powerful computers. We need basic optimizations and need to use the dedicated processor on Tensor/Snapdragon for accelerating that part of it.
@3 Most of the CPU time currently isn't spent doing neural network processing. The assumption that it's not as fast as wanted because of neural network processing is wrong. It can be made drastically faster without hardware acceleration and then using hardware acceleration will put it into a completely different performance category.
If there wasn't a GPU, the AOSP GUI would be very slow and janky due to always being rendered in 3D with Vulkan. It's very sensible and results in a better GUI.
@MaddieM4 was it tested at high speech rates? no. was it filtered to remove the weird echo it has? no. was it filtered to avoid weird inflection and pitch issues that, surprise fucking surprise, would not have happened if you used algorithmic generation? NO! yall tested it at the default speech rate, said "yep it speaks", and shipped it! Get rid of the neural model! nobody wants it, it's how you do tts in the 2020s I know because sighted people want to match the style of google tts- no! bad! hands off the neural models!
@MaddieM4 It's the first experimental public release. We know it has assorted artifacts, performance issues, crashes and other quirks. It reached the point where it's useful and ready to make available for early public testing and that's what's happening right now. It's going to have massive improvements made to it from this starting point. It's the proof of concept phase and a lot more resources are going to go into it. We want to do a lot more than English (US) TTS in this area.
@MaddieM4 We plan to implement support for various other languages and also speed-to-text support too. We're building it in a way where we can do this with most of the work shared between them by relying on open data projects for training speech models in various languages. If we were only making an English (US) TTS for use with a screen reader then it could be approached differently. We're going to be taking on multiple more ambitious projects than usual in the near future not only this.
@MaddieM4 precisely. disabled user here, google has billions of dollars, and even their tts is a high-latency mess sometimes. drop the AI. drop the neural generation. go back to the old way. please
@MaddieM4 exactly. blind users need a tiny, minimalist, fast tts engine.... hell, fuck, I can't even crank up the speech rate on the GrapheneOS tts last like 250% or it starts to glitch, and no, you can't fix that, you're using an under-trained model based on real human voice data, that's never going to work at high speeds, come on, yall can do better. I know yall can do better
does espeak mean emacspeak here? is there an open source project or research paper you consider to be best in class that could be used as a reference for this approach?
i'm responding without tagging grapheneos bc your argument here aligns with my understanding of how to build expert tooling and i'm interested in helping to make this case / becoming more familiar with this technology. i'm about to reply to them separately too along these lines
We have the same thing with #NVDA on windows. By default it uses natural sounding voice built into windows that has the same mistakes you are pointing out here, then it has espeak-ng included as an opensource alternative capable of running at very very high speech rates. And you are still adding eloquence into your setup eventhough you need to pass extra hoops in order to even make it run on the modern hardware and software configurations.
It's a very good thing you are advocating for all the blind users here but still if I were you I would try to calm down a bit.
@pvagner We're planning on making text-to-speech and speech-to-text for at least around 10 languages. English (US), English (UK), German, French, Japanese, Spanish and Dutch would be a good starting point based on the top languages we estimate are used by our users based on where updates are downloaded. It depends on there being solid open data available for each language.
It isn't optimized yet. It spends most time in the library for adjusting speech and pitch, not neural processing.
@pvagner For neural networks, every device supported by GrapheneOS (Tensor) and devices we'll be supporting in the future (Snapdragon) has a large portion of the SoC dedicated to accelerating neural networks. The reason Pixels don't have a much more powerful GPU is because they dedicated the die space to the TPU. Our implementation isn't using the hardware functionality for accelerating it yet. We're doing the equivalent of rendering 3D graphics on the CPU. It will get drastically faster.
solo
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to solo • • •@solonovamax
GrapheneOS: we added tts!
GrapheneOS: it's a neural voice with 100+ms of audio latency
fuck....sake...
solo
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to solo • • •solo
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to solo • • •solo
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to solo • • •solo
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •ah, yeah
I'm doing more of an "ik some people dislike talking to others so I'm offering to help"
Ra (Freyja) (it/its)ππΉπ π©
in reply to solo • • •solo
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •moth bitch
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to moth bitch • • •moth bitch
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to moth bitch • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to moth bitch • • •solo
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to solo • • •Ity Kitty [unit X-69]
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to Ity Kitty [unit X-69] • • •Ity Kitty [unit X-69]
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •solo
in reply to Ity Kitty [unit X-69] • • •Ity Kitty [unit X-69]
in reply to solo • • •MaddieM4
in reply to Ity Kitty [unit X-69] • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to MaddieM4 • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to Ity Kitty [unit X-69] • • •solo
in reply to solo • • •from matrix
High Energy Social
social.highenergymagic.netsolo
in reply to solo • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to solo • • •solo
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to solo • • •GrapheneOS
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •GrapheneOS
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •It's a different type of software and is meant to become a competitive alternative to Google Speech Recognition & Synthesis. It has a lot of room for improvement including simply not reading newlines which can waste a lot of time if the text being passed to it doesn't have the whitespace stripped down.
It doesn't use hardware acceleration yet and was trained on a AMD RX 6600 prior to obtaining an RTX 5090. It can't be expected to keep up with Google's yet but it will get better.
Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •GrapheneOS
in reply to GrapheneOS • • •It's already good enough for a lot of use cases and blind users can now use a fresh install of GrapheneOS without help. It's good enough to get through the setup wizard to install Google Speech Recognition & Synthesis or eSpeak NG.
It matters a lot which device you're testing on since it might be twice as fast on a Pixel 10 than a Pixel 6. It solely runs on the CPU right now and Pixels don't have a great CPU. It will benefit a lot from hardware acceleration, especially on Pixels.
Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •GrapheneOS
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •MaddieM4
in reply to GrapheneOS • • •GrapheneOS
in reply to MaddieM4 • • •@MaddieM4 Many blind users have told us they use Google's Speech Recognition & Synthesis. Why isn't it possible to provide a competitive open source implementation of what they provide? They're incredibly understaffed and make lots of bad decisions. We can use more bleeding edge technology than they can if we put resources into it.
Now a blind user can start with a fresh GrapheneOS install and set it up themselves without needing help from anyone including installing their preferred TTS.
Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •GrapheneOS
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •@MaddieM4 Many blind users told us they use Google's Speech Recognition & Synthesis and are happy with it. We're entirely capable of making ours perform as well as that. The first experimental release of our app isn't going to be competitive with theirs yet. It has a lot of straightforward bugs and performance issues to resolve.
We built our own network-based location too and we keep improving it. It went from barely working to extremely competitive. Our latest release improved it more.
Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •d@nny disc@
in reply to GrapheneOS • • •Sensitive content
[edit: after writing all this i realize this is a topic i'm interested in researching further myself. if i follow up on this myself i will come back to you later with a more specific design proposal. i strongly urge your team to read this feedback as a proposal for future work. love you]
hopefully this feedback can be understood to represent a benchmark that the grapheneOS team can compare their work against, so that the team can be confident their work is not merely commensurate with google's best, but actively better. i'm also curious whether waveform generation leads to more easily auditable code that can be tuned to specific use cases by developers, instead of being blocked upon a centralized training processβsuch a setup also seems like it could make fixing bugs more difficult.
i am unfortunately not willing to put in the effort to prototype or research this alternative waveform-based non-statistical approach myself, so i don't plan to harangue you any further on this subject [edit: i will harangue if i can bring you a more concrete proposal]. i am chiming in because i believe it is worth your time to consider the feedback from user freya in the context of a potential longer-term research program which would require significant investment but may demonstrate equally significant returns in the form of end user empowerment.
the analogy i would draw here would be IDE indexing which is able to perform specific classes of queries, but isn't user extensible and limits programming language support, vs regex search within a code directory, which has deterministic behavior and responds to user feedback. an attempt at direct waveform generation as proposed by user freya may not remotely sound lifelike, but a machine is not supposed to sound pretty.
google translate several years ago switched from using classical NLP with per-language heuristics to a zero-shot unsupervised approach several years ago, and it immediately fell from being an expert tool (much like a dictionary that supports grammatical queries) down to something that assumes the user doesn't really care about nuance (this was before the current manifestation which additionally rewrites output to match its training data). i used the original version of the tool while intensively studying multiple languages, and as a semi-fluent speaker i now prefer a dictionary like jisho.org over the new translation paradigm.
i don't believe you're incorrect to claim that the current approach is satisfactory for users (nor do i believe freya is contesting that). i also very much understand why the statistical approach is attractive, because waveform generation is an extremely specialized area of study which tends to be limited to proprietary systems. the proposal to heuristically generate speech waveforms represents a risky investment, and the success criteria would be incomparable to google's offering. i understand the fear of being designated as "archaic" in the press for appearing to avoid the "modern" statistical approach.
i would urge you to adopt a less defensive stance on this matter and instead to see this as an engineering tradeoff. i furthermore urge you to consider why deterministic behavior might be attractive to expert users (particularly programmers) over statistical output that can't be easily specialized to specific contexts. i believe freya is giving you advice because she sees grapheneos as genuinely interested in doing the right thing.
GrapheneOS
in reply to d@nny disc@ • • •Sensitive content
GrapheneOS
in reply to GrapheneOS • • •GrapheneOS
in reply to GrapheneOS • • •GrapheneOS
in reply to GrapheneOS • • •tusharhero
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to tusharhero • • •MaddieM4
in reply to GrapheneOS • • •disabled users are often a very specific kind of power user, with specific power user needs. You have disabled people telling you this, and that a neural TTS will never operate at acceptable speeds even if you were able to leverage the TPU for latency.
I know y'all are very proud of the thing you made. But was it the correct cool thing to make? Survey says no. "Well Google solves this problem with the TPU and plenty of people cope with what they're given" is not a very good product design rationale.
GrapheneOS
in reply to MaddieM4 • • •@MaddieM4 Our TTS and screen reader don't need to be the best options for blind users yet but rather good enough to set up the OS themselves including obtaining their preferred software. Prior to this OS release, it wasn't possible for a blind user to set up GrapheneOS themselves after installing it. It's now very straightforward.
Google doesn't release the TalkBack source code properly and some functionality depends on Play services. It's fine because it's good enough to obtain any app.
Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •GrapheneOS
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •GrapheneOS
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •GrapheneOS
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •@MaddieM4 TTS being very good means it needs to have both high performance and high quality instead of only one or the other. Apple and Google are both aiming to do both and we should be too.
The latency should already be quite good but it doesn't yet work well at high speeds which isn't something we've been focusing on yet. There's a lot of room for improvement in multiple ways. It should be possible to make it an order of magnitude faster and various quirks also need to be fixed too.
Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •@MaddieM4 it will not, and cannot, work at high speed because of how you have created it. it cannot do it. dso you understand? neural tts does not scale to the speech rates I require. For context, *this* is what it needs to be able to sound like:
bachimusprime
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Leszek Karlik
in reply to bachimusprime • • •For non-sighted users, having the entire area of the brain we use to process images free for audio processing helps a lot.
GrapheneOS
in reply to GrapheneOS • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •GrapheneOS
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •trinity
in reply to GrapheneOS • • •GrapheneOS
in reply to trinity • • •@3 We're planning on making text-to-speech and speech-to-text for at least around 10 languages. English (US), English (UK), German, French, Japanese, Spanish and Dutch would be a good starting point based on the top languages we estimate are used by our users based on where updates are downloaded. It depends on there being solid open data available for each language.
It isn't optimized yet. It spends most time in the library for adjusting speech and pitch rather than running a neural network.
GrapheneOS
in reply to GrapheneOS • • •GrapheneOS
in reply to GrapheneOS • • •GrapheneOS
in reply to GrapheneOS • • •GrapheneOS
in reply to GrapheneOS • • •GrapheneOS
in reply to GrapheneOS • • •@3 Most of the CPU time currently isn't spent doing neural network processing. The assumption that it's not as fast as wanted because of neural network processing is wrong. It can be made drastically faster without hardware acceleration and then using hardware acceleration will put it into a completely different performance category.
If there wasn't a GPU, the AOSP GUI would be very slow and janky due to always being rendered in 3D with Vulkan. It's very sensible and results in a better GUI.
Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •GrapheneOS
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •GrapheneOS
in reply to GrapheneOS • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to MaddieM4 • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to MaddieM4 • • •Ο
in reply to GrapheneOS • • •d@nny disc@
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •does espeak mean emacspeak here? is there an open source project or research paper you consider to be best in class that could be used as a reference for this approach?
i'm responding without tagging grapheneos bc your argument here aligns with my understanding of how to build expert tooling and i'm interested in helping to make this case / becoming more familiar with this technology. i'm about to reply to them separately too along these lines
Ra (Freyja) (it/its)ππΉπ π©
in reply to d@nny disc@ • • •d@nny disc@
in reply to Ra (Freyja) (it/its)ππΉπ π© • • •Ra (Freyja) (it/its)ππΉπ π©
in reply to GrapheneOS • • •Peter VΓ‘gner
in reply to Ra (Freyja) (it/its)ππΉπ π© • •@Ra (Freyja) (it/its)ππΉπ π© I know you are trying to make the most of this for you and other power users, however I just wish to add that not all the blind users are so hooked up to eloquence TTS and only a few of them can use their TTS at such high speech rates. I have understood with this initial version of Graphene Speech services we can at least independently pass the initial setup of @GrapheneOS and then we can switch to our TTS synthesizer of choice. For example I am very unlikely to stop using #RHVoice and you are very unlikely to stop using #Eloquence no matter what do the Graphene OS folks implement.
Having really good modern open-source natural sounding TTS for reading books is not a waste as you are pointing out especially if they are planning to make the other direction part of the app i.e. speech to text as well.
We have the same thing with #NVDA on windows. By default it uses natural sounding voice built into windows that has the same mistakes you are pointing out here, then it has espeak-ng included as an opensource alternative capable of running at very very high speech rates. And you are still adding eloquence into your setup eventhough you need to pass extra hoops in order to even make it run on the modern hardware and software configurations.
It's a very good thing you are advocating for all the blind users here but still if I were you I would try to calm down a bit.
Jonathan likes this.
D.Hamlin.Music reshared this.
GrapheneOS
in reply to Peter VΓ‘gner • • •@pvagner We're planning on making text-to-speech and speech-to-text for at least around 10 languages. English (US), English (UK), German, French, Japanese, Spanish and Dutch would be a good starting point based on the top languages we estimate are used by our users based on where updates are downloaded. It depends on there being solid open data available for each language.
It isn't optimized yet. It spends most time in the library for adjusting speech and pitch, not neural processing.
Peter Vágner likes this.
GrapheneOS
in reply to GrapheneOS • • •Peter Vágner likes this.