Idea:
Image recognition apps for the blind should do an OCR pass on the image first, and include any text found in the prompt. The LLM should be told not to do its own OCR and to rely on the text provided.
LLMs are notoriously bad at OCR, particularly for non-English languages. To make things worse, unlike normal OCR engines, their badness doesn't result in typos and garbled text, but in perfectly understandable and grammatically valid text that says something very different than what's actually included in the image.
An OCR pass should help "ground" the LLMs here.
victor tsaran
in reply to Mikołaj Hołysz • • •Jamie Teh
in reply to victor tsaran • • •Mikołaj Hołysz
in reply to Jamie Teh • • •@jcsteh @vick21 1. Do OCR. Tesseract or whatever, there are good models aplenty.
2. Dear LLM, please describe this image. Do not try to perform text recognition, assume that the text provided below is correct and complete, except for possible typos.
You could also use this technique when finetuning, giving it images with lots of foreign-language text, but with the correct text already provided in the prompt.
Jamie Teh
in reply to Mikołaj Hołysz • • •