Skip to main content


From the double tap team. Maybe we shouldn’t advocate for alt text youtube.com/watch?v=CHCK43D0jJ… #Blind #AltText #Accessibility #A11y #AIHype #AI
This entry was edited (6 months ago)
in reply to Robert Kingett backup

One thing they didn't mention which I'm kind of surprised the other guy didn't bring up is the issssssue of reliability. We all know AI has rather a complicated relationship with the truth. That may not really be an issue for social media posts as such, and for what it's worth I do actually agree with the main thrust of the argument, but the fact remains that we can't completely, 100% rely upon these things until AI companies figure out how to make them understand the difference between fact and fiction.
in reply to Haily Merry

That will never happen because of how these LLM’s are designed. They don’t think, so they will never be able to do this
in reply to Robert Kingett backup

You're looking at it far too simplistically. What they do essentially is recognise patterns, which is more or less what we ourselves do when learning languages. What they don't have however is any sort of frame of reference through which to understand context, which we get from every day lived experience. Context sizes in these models are getting larger and larger, so maybe this problem will more or lessss take care of itssssself in a few years, particularly as new training data comes along. Of course it'sss probably also the case that large language models are only the beginning in terms of AI technology, who knows where we'll be in a decade even.
in reply to Haily Merry

Because they generate output from training data input, they will never be able to ascertain truth because, to even get that output, they need to compare other training data that was imported into them. No image is going to be exactly alike. Even an image that is taken three minutes after the initial image, that’s gonna be slightly different even though it might appear to be the same angle. The LLM will extract likenesses from all the images, and it will compare those likenesses, but it’s never going to understamd that’s what’s in the image and that’s not in the image. it literally can’t reason like that
in reply to Robert Kingett backup

Just ask yourself how you arrive at literally any conclusion. Usually, you'll draw upon your own lived experiences, lessons you've learned in school or elsewhere, ETC, and the conclusion you come to will be influenced by all of those things. AI has serious blindspotss right now in large part because it can't draw upon past mistakes. This is part of the problem I have with this debate honestly, many people just seem to fundimentally misunderstand even their own minds and just assume there's something undefinable but utterly unique about the human condition which means that AI will never be able to compete on the same level. Music is perhaps the easiest case in point, people were saying AI couldn't do it for years, yet now it can, and it can do it rather well too, you should see what Suno is like these days.
in reply to Robert Kingett backup

This is rather missing the point of my entire argument. I’m not saying that the human brain works like a computer, in point of fact we still don’t really know a whole lot about how the human brain works, but in any case you can’t really compare a living organism with something man-made. The point I’m making is that the process by which we learn and the process by which AI learns are much more similar than you’d probably like to admit.
in reply to Haily Merry

@weirdwriter@tweesecake while I am on the side of Robert here, I guess you made many good points. The fact that LLMs being seen as the entire AI thingy can be reductive, as symbolic computing and other strands of cognitive sciences are creating this huge field of AI. and, probably, by few years, this tech can be a good tool with its own flaw to assist us in specific, measurable tasks. I am just worried that AI as an academic field would be narrowed due to the proffit-making intent of capitalism.
in reply to Kaveinthran

Even with LLMs, we need to look beyond training data as most of it right now is a web-scraped images which most of it are without good alttext, and have a very low resource value on ASIAN context, cultures and nonhuman context, like various animals, plants etc, so, it is hitting its peak already. The funding is so much congested on letting the LLM game goes well, and its not good to begin with.
in reply to Kaveinthran

Other strand of AIneed to be more expanded, and this process of converting research into capital needed to be stop. if we need a real intelligent system that can contribute to the understanding of complex system, cognition and how intelligence work. It's sad when a field that is full of potential becomes a hyped product.
I follow the AI spaces both from the developer and also from the Academic route, and I feel we need to have more nuanced, grey-hat argument on the potential of AI.
in reply to Kaveinthran

Let's steer the discussion very narrowly to the task at hand, which is image description,
the vision language model are trained generatively to be good at multiple image evaluation task, describing it is only one of them. By listening to the talk description to me podcast, I learnt how much of an expertese goes into describing a scene well, I don't think so a vision model deployed by big companies want to train a model that can do only that work best.
in reply to Kaveinthran

what we need is a good vision language models that can be fined-tuned with thousands of human-curated Q&A of Blind people queering about an image. It needed to be additionally refined with many millions pairs of poor images, good image and descriptions
this is to let the model learn many facet of one image, that may come from phone camera, or poor lighting etc.
we also need and eval system that is specific to AltText making or for the task of describing image to the Blind people.
in reply to Kaveinthran

from what I am seeing right now, @letsenvision is the only research and AT company that can under-take this stuff.
LIke it or not, at the moment, Vision language model, is just average in describing images, it is worst in describing bad images. The overrepresentation of filler words, descriptives etc, makes the technology overhyped. Until we have a blind-focused AI Eval on image description, we are only at step 1 of the development process.
in reply to Kaveinthran

I feel like many players are steering the users into territories that need less accountability.
like, giving AI a personality? I don't think so we need personality, we need our work to be done! We need more autonomy to steer the system to describemore granularly,, sometimes very short, sometimes longer but meaningful.
The personality aspect, while sounds like a system prompt, it just add lots of unrequired aspect to it. I guess we do not need a chat bot to begin with!
in reply to Kaveinthran

As you rightly pointed out, the audio and speech AI Study have progress much stronger and crisper, very much development and truly under-appriciated space, I guess it has many research and also daily use value for it.
in reply to Kaveinthran

I feel weird for saying this, but I honestly wonder if Meta might just be our best hope here. For whatever reason, they seem to have committed themselves to a path of open source AI models at this point, and people are already doing amazing things with it. Given how quickly these things have been optimised to run on consumer hardware, I can only imagine this process will become even more efficient, or the baseline of processors will become more and more powerful as we transition away from X86 / 64. A company with the time / resources to dedicate to a project like the one you have outlined may well find itself on fertile ground very soon, if it isn’t already.
in reply to Haily Merry

I am not sure if a big company may do this well, I am more open to either a small company like envision, or any research centers or even community centered approach towards this.
we first need a constitution-like principles to ground this vision AI
that is by itself is complicated. We need AI build from ground up that knows how to describe an image to a blind person just like the expert audio and image describer.
in reply to Kaveinthran

an small local model that is build just to describe images, and, fined-tuned with human curated data contributed by Blind, Low Vision and human describers.
the AI principles or constitutions are there to guide the fine tuning process, like be helpful, be brief, be concise, but this is just a wattered down example.
to create a constitution, we need to answer questions like, what does a good descriptions entailed? how to describe dressings? how to describe humans?
in reply to Kaveinthran

other questions can be more like what is important? what is les important? how to contextualise better? what are the guidelines to perceive an image from various typoligies? how to frame human visual standard?
see the compoication?
we as humans, do all this in magical second, and the copious data that we have now do not account for even little complexity!
the AI system that we have now is too general, if we need a true describer, I feel we need to work for it as community!
in reply to Kaveinthran

I guess all this insights is not original, I think about it more after listening to the talking description to me Podcast, shout out to @ChristineMalec
in reply to Kaveinthran

The problem is that you need a pretty huge context size in order to do what you want to do, I’m actually not sure what the hardware requirements for large context models look like at the moment. I know they were pretty bad a year ago, but these things move quickly so I’d have to check at this point.