VScan is my little research project aiming to explore whether large language models (LLMs) with vision support could be used by blind people as a tool for performing visual cognitive tasks, and how useful could they be on travel and in day to day life. Our world is built on visual communication. Eyesight is a very rich and detailed sense, but most of the time, people are not scanning their field of view pixel by pixel, but they are purposefully looking for specific items or events in the environment, like a directional sign, the line number printed on a incoming bus, or their train appearing on the annoucement board of a train-station. Layouts of buildings and open spaces are communicated visually, contextual information is communicated visually, announcements and warnings are communicated visually.
What if we could bridge the visual communication of the world to our senses? VScan is trying to find out whether this is possible, feasible and whether and to what extend would it be useful. The entire project is open-source and tries to be as clear and transparent about the used technology as I'm able to communicate.
The recent months have been very important for VScan. In the new versions you can now easily use any OpenAI protocol compatible LLM server (including a self-hosted one), and run any common or uncommon model on it. The new system is very flexible, so you can configure everything to the tiniest detail and easily swap the used models or their providers. At the same time, the app comes with presets, which should (hopefully) make simple things simple and complex things possible.
Since this is a pretty large change, I highly recommend reading the initial setup part of the project's readme to understand the new system. If you're updating from v0.2 or earlier, you will need to do some configuration changes.
There are several reasons why this feature is so important. It makes VScan truly universal, not just in terms of the providers, but also the used technology. Now, VScan has the freedom to leverage the capabilities of the entire LLM market instead of just a single player, which brings some very interesting potential crucial for the next development.
It also gives you a complete control over your data, so you can decide the processing destination for each pixel you capture with your camera.
The new model system is powerful, but it's not the only added feature. There are many more, including a very convenient multipurpose text field on the scanning screen, a new action system, or a new prompt editor. Make sure to read the release notes for more details!
One more announcement. VScan has finally passed the review process for inclusion in F-Droid, and got also released in Google Play. You should now be able to find the app in your favourite app store.
Happy hacking!
github.com/RastislavKish/VScan

A visual perception layer for the blind. Contribute to RastislavKish/VScan development by creating an account on GitHub.
GitHub 
Jamie Teh
in reply to Amir • • •Amir
in reply to Jamie Teh • • •