Mikołaj Hołysz

2 months ago • •

Mikołaj Hołysz
2 months ago • •

Idea:

Image recognition apps for the blind should do an OCR pass on the image first, and include any text found in the prompt. The LLM should be told not to do its own OCR and to rely on the text provided.

LLMs are notoriously bad at OCR, particularly for non-English languages. To make things worse, unlike normal OCR engines, their badness doesn't result in typos and garbled text, but in perfectly understandable and grammatically valid text that says something very different than what's actually included in the image.

An OCR pass should help "ground" the LLMs here.

in reply to Mikołaj Hołysz

victor tsaran

in reply to Mikołaj Hołysz • 2 months ago • •

I feel that LLMs should also improve in this area.

in reply to victor tsaran

Jamie Teh

in reply to victor tsaran • 2 months ago • •

@vick21 Cases like this are where it would make a lot of sense to combine various different models and tools, but I don't quite understand how and where that handoff would occur in the case of an LLM. It's just too much of a black box. That said, my understanding of these things is minimal at best.

@victor tsaran

in reply to Jamie Teh

Mikołaj Hołysz

in reply to Jamie Teh • 2 months ago • •

@jcsteh @vick21 1. Do OCR. Tesseract or whatever, there are good models aplenty.

2. Dear LLM, please describe this image. Do not try to perform text recognition, assume that the text provided below is correct and complete, except for possible typos.

You could also use this technique when finetuning, giving it images with lots of foreign-language text, but with the correct text already provided in the prompt.

@Jamie Teh @victor tsaran

in reply to Mikołaj Hołysz

Jamie Teh

in reply to Mikołaj Hołysz • 2 months ago • •

@vick21 It could still mutate the provided text on output as a halucination, though. Also, I wonder how well this would work where the text isn't in a single block or where OCR doesn't get the order right but where the order could easily have been inferred by the LLM? Restaurant menus are a good example of both cases.

@victor tsaran

⇧

Mikołaj Hołysz

Mikołaj Hołysz 2 months ago • •

victor tsaran

Jamie Teh

Mikołaj Hołysz

Jamie Teh

Mikołaj Hołysz
2 months ago • •