I figured out how to get this running on the NVIDIA Spark (ARM64, which makes PyTorch a little bit trickier than usual) by running Claude Code as root in a new Docker container and having it figure it out. Notes here: https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-co...
Looks like this did really solid, with the exception of the paragraph directly below the quote. It hallucinated some filler there and bridge it with the next column.
By my eye it just bridge. I didn't see any filler. It went from "Code is a language" - above the quote and then to "in a garden by name." which was the top of the next column but missing the chicken subject.
It missed the initial "A" in the text which I sort of understand, seems not a lot of news articles were put in the dataset. But more interestingly, it missed the entire "Hallucination is a risk and...", the article "theme" next to the author name also the final email.
Here's a result I got https://github.com/simonw/research/blob/main/deepseek-ocr-nv... - against this image: https://static.simonwillison.net/static/2025/ft.jpeg