Jamie Teh

LLaVA-1.5 is an open-ish AI model which can provide image descriptions and allow follow-up interaction, akin to Be My AI. The best part is that you can run it locally on your computer if you have an appropriate GPU... or very, very slowly if you want to use your CPU. I thought it'd be cool to hook it up to #NVDASR so you can get image descriptions for the current navigator object and then ask follow-up questions. So, I wrote an NVDA add-on to do just that using llama.cpp. github.com/jcsteh/nvda-llamaCp…

GitHub - jcsteh/nvda-llamaCpp: NVDA add-on which allows you to use llama.cpp to obtain image descriptions and ask follow-up questions using AI models.

NVDA add-on which allows you to use llama.cpp to obtain image descriptions and ask follow-up questions using AI models. - GitHub - jcsteh/nvda-llamaCpp: NVDA add-on which allows you to use llama.cp...

^GitHub

#nvdasr

reshared this

in reply to Jamie Teh

David "Dave" Treloar

in reply to Jamie Teh • 1 year ago • •

Interesting, I had to look up some things (NVDA), but this is a good use of AI & GPU power to increase accessibility, something you would not normally think of as a natural combination.

This entry was edited (1 year ago)

in reply to Jamie Teh

Jamie Teh

in reply to Jamie Teh • 1 year ago • •

The accuracy of this model (specifically, the 7 billion parameter version with 4 bit quantisation) isn't stunning, but it's still pretty impressive given that it's running locally. There may also be things I can tweak to improve that; I'm very inexperienced with AI stuff.

in reply to Jamie Teh

Chi Kim

in reply to Jamie Teh • 1 year ago • •

You should try Bakllava with q8 quantization for better accuracy. Bakllava is a Mistral 7B base augmented with the LLaVA 1.5 architecture. They also have q4 version if it's too slow. gist.githubusercontent.com/chi…

in reply to Chi Kim

Jamie Teh

in reply to Chi Kim • 1 year ago • •

@chikim Intriguing. I haven't tried the q8 or 13b version of LLaVA yet either. I notice here that you are using the q8 model with the f16 mmproj. I assumed the quantisation had to match for the main model and the mmproj, but clearly not? Is there a reason you mismatch them here?

@Chi Kim

in reply to Jamie Teh

Chi Kim

in reply to Jamie Teh • 1 year ago • •

As far as I know, all quantized variants use the same f16 mproj. 7b vs 13b has to have the matching mproj though.

in reply to Chi Kim

Jamie Teh

in reply to Chi Kim • 1 year ago • •

@chikim Ah interesting. LLaVA has q4, q8 and f16 mmproj, so I went with q4 to match.
I've read that the quantisation matters less for accuracy than the number of parameters, but I think it depends on... a lot of things.

@Chi Kim

in reply to Jamie Teh

Chi Kim

in reply to Jamie Teh • 1 year ago • •

Oh wait, really? they have separate mproj for each quantizations? Maybe bakllava didn't want to do that, so they just use f16. lol Yeah I think there's slight improvement between higher q vs lower. You're right in general 13b q4 will be better than 7b q8 if you compare the same model with different number of parameters.

in reply to Jamie Teh

techsinger

in reply to Jamie Teh • 1 year ago • •

I've been waiting for this, I can't thank you enough. Do you happen to have any ideas, if I may ask, about binaries for use on a CPU? You mention that it can be used slowly on one, but the github just shows the Nvidia binaries. Thanks again.

in reply to techsinger

Jamie Teh

in reply to techsinger • 1 year ago • •

@techsinger It honestly isn't worth it unfortunately; it'll take over a minute to answer each query. But if you do want to try, you could just use llamafile-server, which is a single binary. Download this and rename it to server.exe, then follow the rest of the instructions in my readme, skipping anything related to the zip files. github.com/Mozilla-Ocho/llamaf…

@techsinger

in reply to Jamie Teh

techsinger

in reply to Jamie Teh • 1 year ago • •

Oh I see, thanks. When you said slow, you really meant it :) Seriously, thanks again both for this last piece of info and for putting this add-on together.

in reply to Jamie Teh

Toni Barth

in reply to Jamie Teh • 1 year ago • •

Sounds interesting, just building it myself right now. Aren't you shipping the models with the add-on because of their size, or is there any specific reason? Oh and NVDA+Shift+L is the golden cursor shortcut for saving a mouse position, maybe its worth changing that one as Golden Cursor currently is pretty common I think.

in reply to Toni Barth

Jamie Teh

in reply to Toni Barth • 1 year ago • •

@ToniBarth They're huge, 4 gb+. I don't want to host that and it seems kinda pointless given that there's still some technical messing around required; figuring out whether you have the right GPU, etc. If you try to do this on CPU, it'll take over a minute to answer each query. I can point you to CPU binaries if you can't run it on GPU and still want to try it though.

@Toni Barth

in reply to Jamie Teh

Mike Wassel

in reply to Jamie Teh • 1 year ago • •

I tried building this on wsl, and the server is running, but I never get anything back from NVDA. My fan speeds up, I see it's processing an image and then it releases the tokens in cache. Nothing that I can see in my log viewer...

in reply to Mike Wassel

Jamie Teh

in reply to Mike Wassel • 1 year ago • •

@blindndangerous If you're running it on CPU, it'll take over a minute to respond to queries. But otherwise, I'm not sure why it would be failing. Does it say anything about how many tokens are in the cache when it releases the slot?

@Mike Wassel

in reply to Jamie Teh

Mike Wassel

in reply to Jamie Teh • 1 year ago • •

3 of them.

in reply to Mike Wassel

Jamie Teh

in reply to Mike Wassel • 1 year ago • •

@blindndangerous Hmm. Do you see anything in the output about encoding images?
slot 0 - encoding image [id: 10]

@Mike Wassel

in reply to Jamie Teh

Mike Wassel

in reply to Jamie Teh • 1 year ago • •

Size is weird, I just took something. But it happens the same if it says 1920x1080 slot 0 - image loaded [id: 10] resolution (38 x 22)
slot 0 is processing [task id: 4]
slot 0 : kv cache rm - [0, end)
slot 0 - encoding image [id: 10]
{"timestamp":1701472031,"level":"INFO","function":"log_server_request","line":2601,"message":"request","remote_addr":"127.0.0.1","remote_port":33884,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 released (3 tokens in cache)

in reply to Mike Wassel

Jamie Teh

in reply to Mike Wassel • 1 year ago • •

@blindndangerous The previous add-on build would have timed out after 10 seconds, though you definitely should have seen an error in the NVDA log in that case. I pushed another build which increases the timeout to 3 minutes.

@Mike Wassel

in reply to Jamie Teh

Mike Wassel

in reply to Jamie Teh • 1 year ago • •

Sorry, releases the slot not the tokens.

⇧

Jamie Teh 1 year ago • •

Jamie Teh
1 year ago • •