in reply to NV Access

Seriously though, we know the inbuilt image descriptino is not as good as say the AI Content Describer or CloudVision add-ons - it is local and on-device, so it won't be. If you use those add-ons this won't replace those. We are designing our feature so the model can be swapped for newer and better models in the future (if you know of one which works locally and on-device which is better, please let us know). It IS working even with imperfect descriptions, so it's a start!
in reply to NV Access

It's not a matter of replacing add-ons, but a matter of how beneficial this feature is. While it is perfectly understandable that the smallest models are the best fit for everyone to be able to make use of this image recognition service that NVDA provides, based on people's feedback, the current model is obviously failing to provide even the least amount of helpful information. Perhaps implementing a way to introduce different models to the screen reader to use for image recognition would be a viable option. Otherwise, I personally cannot see any reason for this to be included in the upcoming 2026.1 version. Just as another example, this is what I'm getting when I scan the input box I'm writing this post in: "a black and white photo of a wooden wall".
in reply to Kianoosh Shakeri

Ok so firstly, it definitely doesn't work with text - use NVDA+r for text. When I tried it on images, I found it gave me something useful around 2/3 of the time. Not ideal, but more than the 0% you get on images with no alt text currently. We are working on making it so the model can be swapped out, that was always the plan. Of course one thing to note, the feature itself DOES work, so that is good to know - and we hear you loud and clear that the model is imperfect.
in reply to NV Access

My intention was for it to detect that I'm simply inside an application window, even the detection of what element the focus was on didn't matter; we're talking about image captioning after all, not full on image description. Of course it is not expected for the model to be perfect - what model is perfect today even! 😊
That being said, there are surely cases that can hinder the model to understand the displayed visuals - various display settings such as screen lighting and even display resolution should affect the output results profoundly. I'll surely tweak what I can to test the models performance. I'll let you know if I found something useful, so it hopefully can be included in the corresponding documentation.
Thank you for the hard work! 🙏