TIL: There's a W3C candidate recommendation draft for a CSS markup to transfer different properties of text and controls on the web via audio cues and changes to the TTS volume, speech rate, tone, prosody and pronunciation, kind of like the attributed strings in iOS apps and it's called CSS Speech. w3.org/TR/css-speech-1/ #Accessibility #A11y #Blind

reshared this

in reply to Paweł Masarczyk

There are people who seem to feel really strongly about this being a good thing for screen reader users, and I must admit to being bewildered about why. Websites changing aspects of screen reader output may be equitable, if we compare it with the way webpages can alter visual presentation through fonts and other aspects. But to me it feels entirely inappropriate to cross that boundary between the browser as the user agent and accessibility software in order to interfere with very personal settings.

Meanwhile on iOS, the related accessibility attributes are being used to achieve outcomes nobody wants or needs, like spaces between all the digits of a credit card number. @miki @prism

in reply to James Scholes

I can see the point for e.g. text-to-speech APIs built into the browser, maybe even read-aloud features. But the case for screen reader compatibility seems to be built on the foundational assertion that SR output is monotonous and can't be "livened up" by brands.

As assertions go, I think that is both true and exactly how it should be. I don't use a screen reader for entertainment. I can think of few things more obnoxious than a marketing person thinking that my screen reader should "shout this bit."

Many web authors can't even label stuff correctly. Why on earth would we expect them to treat this sort of feature with informed respect? @miki @prism

in reply to Drew Mochak

@prism I think without ARIA or an equivalent (like more things built into the web platform), the web would've continued galloping forward with all the same UI widgets and design patterns but with no way to make them even halfway accessible, and we'd be left even more behind than we are now.

By contrast, I don't think the inability for a website to change the pitch of NVDA is a legitimate blocker to anything worthwhile. @Piciok @miki

in reply to James Scholes

@jscholes I have felt for a while that only having TTS for everything is pretty limitting. So, you know, I use unspoken. Problem solved. I haven't really thought to myself, self, it would be great if the website author could script some nonverbal feedback for me instead of what I am currently hearing, or anything like that. So this may well be a solution in search of a problem.
@Piciok @miki
in reply to Drew Mochak

@prism @jscholes @miki I don't see the point because everyone has different ways they like to hear things. People choose the verbosity and speech options that work for them and to have something override that would be irritating. I also feel that this is part of a larger conversation about the perceived need for sighted people to feel like our experience of the web is vastly different. This is why we have a lot of unnecessary context already and here is another example.
in reply to Mikołaj Hołysz

@silverleaf57 @prism @jscholes I, for one, would certainly appreciate if I could hear exactly which parts of a line of code have "red squiggles" under them, preferrably with different styles for error and warning. This is something sigted people have. Visual Studio Code solves this with audio cues, but those are per line, not per character range.
in reply to Mikołaj Hołysz

@miki I think it's a trap to suggest that such problems should currently be solved only through speech properties and auditory cues within individual apps. Expressive semantics on the web have only been explored at a surface level so far, and it's a complete stretch to go from "We don't have the ARIA properties to convey complex information," to "Let's have every application implement its own beeps and boops."

Imagine having to learn the sound scheme for Gmail, then Outlook, then Thunderbird. Then going over to Slack where they also have unread state albeit for chat messages rather than emails, but they use an entirely different approach again.

All the while, braille users are getting nothing, and people who struggle to process sounds alongside speech are becoming more and more frustrated. Even if we assume that this is being worked on in conjunction with improvements to ARIA and the like, how many teams have the bandwidth and willingness to implement more than one affordance?

We've already seen this in practice: ARIA has braille properties, but how many web apps use them? Practically none, because getting speech half right and giving braille users an even more subpar experience is easier. Your own example highlights how few apps currently let you control things like verbosity and ordering of information.

CSS Speech could turn out even worse. A product team might opt to implement it instead of semantics because the two blind people they spoke to said it would work for them, and never mind the other few million for whom it doesn't. They'll be the people complaining that there's no alternative to the accessibility feature a team spent a month on and thought was the bee's knees.

@silverleaf57 @prism @Piciok

in reply to Mikołaj Hołysz

@miki There is much shared (or adjacent) iconography in the world, with a lot more power and opinion behind it than the sounds for a web app are going to get. Despite that, icon fatigue is a real and common user complaint; it seems bizarre to be leaning into such an issue purely in the name of equity. @silverleaf57 @prism @Piciok
in reply to James Scholes

@jscholes @silverleaf57 @prism Efficiency, not equity.

Words are a precious resource, far more precious than even screen real estate. After all, you can only get a fairly limited amount of them through a speaker in a second. We should conserve this resource as much as we can. That means as many other "side channels" as we can get, sounds, pitch changes, audio effects, stereo panning (when available) and much more.

Icon fatigue is real. "me English bad, me no know what delete is to mean" is also real, and icons, pictograms and other kinds of pictures is how you solve that problem in sighted land.

Obviously removing all labels and replacing it with pictograms is a bad idea. Removing all icons and replacing them with text... is how you get glorified DOS UIs with mouse support, and nobody uses these.

in reply to Mikołaj Hołysz

@jscholes @silverleaf57 @prism Everything said above also applies to braille, Braille cells are even more precious than words in a speaker. It's a schame we can abbreviate "main landmark heading level 2" to something more sensible, but we can't abbreviate "unread pinned has attachment overdue" if those labels are not "blessed" by some OS accessibility API.
in reply to James Scholes

@miki Note that I'm specifically responding to your proposed use case here. You want beeps and boops, and I think you should have them. But:

1. I think you should have them in a centralised place that you control, made possible via relevant semantics.

2. I don't think the fact that some people like beeps and boops is a good reason to prioritise incorporating beeps and boops into the web stack in a way that can't be represented via any other modality.

@silverleaf57 @prism @Piciok

This entry was edited (1 week ago)
in reply to James Scholes

@jscholes @silverleaf57 @prism Centralized beeps and boops don't make much sense to me. Each app needs a different set, let's just consider important items on a list. That can mean "overdue", "signature required", "has unresolved complaints", "student not present", "compliance certification not granted" or something entirely different. We can't expect screen readers to have styles for all of these, just as we can't expect browsers to ship icons for all of these.
in reply to Mikołaj Hołysz

@miki Sure. Or it can just mean "important" in a domain-specific way that's shared across apps in that domain. We should be taking advantage of that to make information presentation and processing more streamlined, before inventing an entirely new layer and interaction paradigm that hasn't been user tested and will require text alternatives anyway. @silverleaf57 @prism @Piciok
in reply to James Scholes

@miki As noted, I think people who can process a more efficient stream of information should have it available to them. That could be through a combination of normalised/centralised semantics, support for specialised custom cases, and multi-modal output.

My main concern remains CSS Speech being positioned as the only solution to information processing bottlenecks, which I think is a particularly narrow view and will make things less accessible for many users rather than more.

Good discussion, thanks for chatting through it. @silverleaf57 @prism @Piciok

in reply to James Scholes

@jscholes At the same time, I think the chances that CSSSpeech completely takes over the industry and we all stop doing text role assignments is quite low.
explainxkcd.com/wiki/index.php…

So I am decidedly meh about this. It could help but probably won't.
@miki @silverleaf57 @Piciok

in reply to James Scholes

@jscholes @prism @miki @silverleaf57 I found the concept intriguing and am myself in two minds about it. On one hand, I wouldn't mind having the speech experience augmented by things that aren't words. I could imagine browsing a product's details page and reading upon all of it's features with tiny earcons indicating whether certain feature is supported or not rather than hearing "Yes" and "No" every time. This could even be played at the same time as the readout begins. To be fair, I also don't mind having the pronunciation of tricky words that are important for proper understanding and functioning in a domain, predefined just so I could learn it. Character and number processing might come in handy too - recently there was an issue on the NVDA Github opened against a feature to read combinations of capital letters and digits as separate entities for the benefit of ham radio operators and their call signs. Some kinds of numbers I also find easier to remember when they come digit-by-digit etc. The ability to define the spatial location of voice on the stereo sound spectrum could be useful for presenting those spatial relationships in some advanced web apps (thinking scientific contexts, design, web text and code editors etc.. As you say, however, I wouldn't expect this being widely adopted by web devs who already struggle with the proper use of ARIA. Also the trade-offs could be significant, especially if this becomes the sole way of conveying information. Blind users with a profound hearing impairment who will miss out on crucial information because it was read out too quietly, too fast and with a pitch that takes away some of the frequencies they can't discern any more; neurodivergent people confused by sudden changes and unfamiliar sounds on top of exotic keyboard shortcut choices they already have to remember etc. This could create a situation similar to WCAG SC 1.4.1 where the colour is used as the only way of conveying information.
in reply to Paweł Masarczyk

This already exists though, as a screenreader feature. Kind of. NVDA has an add-on called unspoken that will replace the announcement of roles with various sounds, there's a different one for checked vs. unchecked boxes for instance. JAWS did (does?) something similar with the shareable schemes in the speech and sounds manager. Granted, not a lot of people do this, but the ability is there if people want it. VO, TB and cvox also have earcons--they're not used for this purpose, but they could be. Having this under the user's control rather than the author's control does seem better. It prevents for instance a developer deciding to be super abtrusive with ads. I do see the potential for it to be good, the author would be able to convey more nuanced concepts being the author of the content... it just feels like a thing most people wouldn't use, and most of the people who'd try would end up being obnoxious about it.

@jscholes @miki @silverleaf57

in reply to Drew Mochak

@prism @jscholes @miki @silverleaf57 Yes, this is what I'm thinking too. Also, the addons are great - I experiment with Earcons and Speech Rules which is another addon with tons of customization. Bringing it on as a core feature would signal it as industry standard though and from that it would be possible to explore whether any external API's could augment it in any way.
in reply to James Scholes

@jscholes @prism @miki @silverleaf57 As for this being widely adopted, I expect some CSS properties could be mapped to the aural cues on a browser lever just like some HTML elements carry implicit ARIA properties with them by default. This would have to be carefully considered. Regarding sound cues: this would have to be based on some kind of familiarity principle where the sounds are those most users will already know or they resemble the action they are supposed to represent, think emptying the recycle bin on Windows. I really like the approach of JAWS representing heading levels through piano notes in C major - it sounds logical but on the other hand not everyone is able to recognize musical notes at random. I'm not convinced about the marketing value of this - I mean creating brand voices etc. It sounds fun but no more than that, at least in the screen reader context. I guess inclusion in advertising is another can of worms that might derail the discussion. I'm looking forward to when NVDA finally incorporates some kind of sound scheme system because we will then be able to talk about some kind of standard given that JAWS and to some extent VoiceOver and Talkback make use of that already. I guess then the discussion could evolve around this being complementary to something like aria-roledescription or aria-brailleroledescription, assigning familiar sounds and speech patterns to custom-built controls.
in reply to James Scholes

@jscholes @prism @miki @silverleaf57 I think inviting @tink and @pixelate into the discussion is a great idea as they might have valuable insights on this. On a related note: something that's been running around my head is how many Emojis could be faithfully represented by sounds.
in reply to Paweł Masarczyk

@jscholes @prism @miki @silverleaf57 @tink So, I generally like beeps and boops. All shiny and stuff. But the web is made by sighted people, and they will get things wrong. I'd rather we have our own tools, like NVDA'S earcons addon, and maybe have earcon packs for it to, for example, add aural highlighting for VS Code, or make-gmail-shiny, stuff like that.