Short answer, just don't, preferrably provide both a HTML alternative and the LaTeX source.
PDF is essentially a vector graphics format, the ultimate end goal of PDF is making a document that prints and displays in exactly the same way for everybody, everything else is secondary. In HTML, the "recommended way to do things" is to essentially say "put a h1 here" and let the browser deal with it, possibly with some help from your style sheet along the way. In PDF, you essentially say "hey, here's some text, put it 2.7 inches from the left margin, 16 point, use font so and so". If you were so inclined, you could even re-order the characters in your font and use completely nonsensical codepoints, and things would still pretty much work visually.
LaTeX definitely uses shenanigans like that, Polish diacritics for example aren't expressed as a single character. Instead, the English letter is used, along with some extra markup that tells the renderer where to draw the acute accents on the page. Those acute accents aren't actually part of the character from an a11y perspective though, they're just random squiggles that the renderer happens to be told to draw. Some say that modern JS frameworks are crazy, I say that PDF is far, far crazier than that.
Speaking onf the two-column stuff in particular, I've seen it work and I've also seen it not work, this probably depends on where the text goes in the document, what it is rendered with, and probably on what software you're using and what their a11y implementation is like.
Yes, there's a way to mark PDFs up for accessibility properly, but very few people do it, LaTeX makes it far harder, there are a lot of other problems (think math), and support among reading programs is... spotty at best.
@miki For the record, I am using #TeXLaTeX to create both the EPUB and the PDF versions of my books, although my requirements are nowhere near those of academic papers. But it should be doable.
As an experiment I downloaded the two column PDF of this new paper from Google research "SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL" research.google/pubs/sql-has-p…
... and uploaded it to Google AI Studio and told Gemini Pro 1.5 "Convert this document to neatly styled semantic HTML" - and the results were pretty good! static.simonwillison.net/stati…
I'd be really worried about both hallucination and prompt injection when using an LLM for document conversion, as an accessibility tool for blind or other disabled users. But the tools I've tried on this paper do worse than what you got out of Gemini.
@matt yeah, me too. The responsible way to do this would be to use Gemini Pro to create the first draft, then spend significant time and effort checking and verifying it, iterating on the prompts, porting across the figures etc
Mikołaj Hołysz
in reply to Simon Willison • • •Short answer, just don't, preferrably provide both a HTML alternative and the LaTeX source.
PDF is essentially a vector graphics format, the ultimate end goal of PDF is making a document that prints and displays in exactly the same way for everybody, everything else is secondary. In HTML, the "recommended way to do things" is to essentially say "put a h1 here" and let the browser deal with it, possibly with some help from your style sheet along the way. In PDF, you essentially say "hey, here's some text, put it 2.7 inches from the left margin, 16 point, use font so and so". If you were so inclined, you could even re-order the characters in your font and use completely nonsensical codepoints, and things would still pretty much work visually.
LaTeX definitely uses shenanigans like that, Polish diacritics for example aren't expressed as a single character. Instead, the English letter is used, along with some extra markup that tells the renderer where to draw the acute accents on the page. Those acute accents aren't actually part of the character from an a11y perspective though, they're just random squiggles that the renderer happens to be told to draw. Some say that modern JS frameworks are crazy, I say that PDF is far, far crazier than that.
Speaking onf the two-column stuff in particular, I've seen it work and I've also seen it not work, this probably depends on where the text goes in the document, what it is rendered with, and probably on what software you're using and what their a11y implementation is like.
Yes, there's a way to mark PDFs up for accessibility properly, but very few people do it, LaTeX makes it far harder, there are a lot of other problems (think math), and support among reading programs is... spotty at best.
Jürgen Hubert
in reply to Mikołaj Hołysz • • •Simon Willison
in reply to Simon Willison • • •As an experiment I downloaded the two column PDF of this new paper from Google research "SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL" research.google/pubs/sql-has-p…
... and uploaded it to Google AI Studio and told Gemini Pro 1.5 "Convert this document to neatly styled semantic HTML" - and the results were pretty good! static.simonwillison.net/stati…
SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL
research.googleMatt Campbell
in reply to Simon Willison • • •Simon Willison
in reply to Matt Campbell • • •