Skip to main content


Idea:

Image recognition apps for the blind should do an OCR pass on the image first, and include any text found in the prompt. The LLM should be told not to do its own OCR and to rely on the text provided.

LLMs are notoriously bad at OCR, particularly for non-English languages. To make things worse, unlike normal OCR engines, their badness doesn't result in typos and garbled text, but in perfectly understandable and grammatically valid text that says something very different than what's actually included in the image.

An OCR pass should help "ground" the LLMs here.

in reply to victor tsaran

@vick21 Cases like this are where it would make a lot of sense to combine various different models and tools, but I don't quite understand how and where that handoff would occur in the case of an LLM. It's just too much of a black box. That said, my understanding of these things is minimal at best.
in reply to Jamie Teh

@jcsteh @vick21 1. Do OCR. Tesseract or whatever, there are good models aplenty.

2. Dear LLM, please describe this image. Do not try to perform text recognition, assume that the text provided below is correct and complete, except for possible typos.

You could also use this technique when finetuning, giving it images with lots of foreign-language text, but with the correct text already provided in the prompt.

in reply to Mikołaj Hołysz

@vick21 It could still mutate the provided text on output as a halucination, though. Also, I wonder how well this would work where the text isn't in a single block or where OCR doesn't get the order right but where the order could easily have been inferred by the LLM? Restaurant menus are a good example of both cases.