Skip to main content


LLaVA-1.5 is an open-ish AI model which can provide image descriptions and allow follow-up interaction, akin to Be My AI. The best part is that you can run it locally on your computer if you have an appropriate GPU... or very, very slowly if you want to use your CPU. I thought it'd be cool to hook it up to #NVDASR so you can get image descriptions for the current navigator object and then ask follow-up questions. So, I wrote an NVDA add-on to do just that using llama.cpp. https://github.com/jcsteh/nvda-llamaCpp

reshared this

in reply to Jamie Teh

Interesting, I had to look up some things (NVDA), but this is a good use of AI & GPU power to increase accessibility, something you would not normally think of as a natural combination.
This entry was edited (5 months ago)
in reply to Jamie Teh

The accuracy of this model (specifically, the 7 billion parameter version with 4 bit quantisation) isn't stunning, but it's still pretty impressive given that it's running locally. There may also be things I can tweak to improve that; I'm very inexperienced with AI stuff.
in reply to Jamie Teh

You should try Bakllava with q8 quantization for better accuracy. Bakllava is a Mistral 7B base augmented with the LLaVA 1.5 architecture. They also have q4 version if it's too slow. https://gist.githubusercontent.com/chigkim/dc9e1b2f8766150499c9510d2934881d/raw/7cd9f0d401c06ee97fdc68f203d7cf314c957678/bakllava.txt
in reply to Chi Kim

@chikim Intriguing. I haven't tried the q8 or 13b version of LLaVA yet either. I notice here that you are using the q8 model with the f16 mmproj. I assumed the quantisation had to match for the main model and the mmproj, but clearly not? Is there a reason you mismatch them here?
in reply to Jamie Teh

As far as I know, all quantized variants use the same f16 mproj. 7b vs 13b has to have the matching mproj though.
in reply to Chi Kim

@chikim Ah interesting. LLaVA has q4, q8 and f16 mmproj, so I went with q4 to match.
I've read that the quantisation matters less for accuracy than the number of parameters, but I think it depends on... a lot of things.
in reply to Jamie Teh

Oh wait, really? they have separate mproj for each quantizations? Maybe bakllava didn't want to do that, so they just use f16. lol Yeah I think there's slight improvement between higher q vs lower. You're right in general 13b q4 will be better than 7b q8 if you compare the same model with different number of parameters.
in reply to Jamie Teh

I've been waiting for this, I can't thank you enough. Do you happen to have any ideas, if I may ask, about binaries for use on a CPU? You mention that it can be used slowly on one, but the github just shows the Nvidia binaries. Thanks again.
in reply to techsinger

@techsinger It honestly isn't worth it unfortunately; it'll take over a minute to answer each query. But if you do want to try, you could just use llamafile-server, which is a single binary. Download this and rename it to server.exe, then follow the rest of the instructions in my readme, skipping anything related to the zip files. https://github.com/Mozilla-Ocho/llamafile/releases/download/0.1/llamafile-server-0.1
in reply to Jamie Teh

Oh I see, thanks. When you said slow, you really meant it :) Seriously, thanks again both for this last piece of info and for putting this add-on together.
in reply to Jamie Teh

Sounds interesting, just building it myself right now. Aren't you shipping the models with the add-on because of their size, or is there any specific reason? Oh and NVDA+Shift+L is the golden cursor shortcut for saving a mouse position, maybe its worth changing that one as Golden Cursor currently is pretty common I think.
in reply to Toni Barth

@ToniBarth They're huge, 4 gb+. I don't want to host that and it seems kinda pointless given that there's still some technical messing around required; figuring out whether you have the right GPU, etc. If you try to do this on CPU, it'll take over a minute to answer each query. I can point you to CPU binaries if you can't run it on GPU and still want to try it though.
in reply to Jamie Teh

I tried building this on wsl, and the server is running, but I never get anything back from NVDA. My fan speeds up, I see it's processing an image and then it releases the tokens in cache. Nothing that I can see in my log viewer...
in reply to Mike Wassel

@blindndangerous If you're running it on CPU, it'll take over a minute to respond to queries. But otherwise, I'm not sure why it would be failing. Does it say anything about how many tokens are in the cache when it releases the slot?
in reply to Mike Wassel

@blindndangerous Hmm. Do you see anything in the output about encoding images?
slot 0 - encoding image [id: 10]
in reply to Jamie Teh

Size is weird, I just took something. But it happens the same if it says 1920x1080 slot 0 - image loaded [id: 10] resolution (38 x 22)
slot 0 is processing [task id: 4]
slot 0 : kv cache rm - [0, end)
slot 0 - encoding image [id: 10]
{"timestamp":1701472031,"level":"INFO","function":"log_server_request","line":2601,"message":"request","remote_addr":"127.0.0.1","remote_port":33884,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 released (3 tokens in cache)
in reply to Mike Wassel

@blindndangerous The previous add-on build would have timed out after 10 seconds, though you definitely should have seen an error in the NVDA log in that case. I pushed another build which increases the timeout to 3 minutes.