I don't know if there is any common practice among multi-modal input "LLM"s as t...

I don't know if there is any common practice among multi-modal input "LLM"s as to how they encode image inputs - convert them into "vision tokens", but it's basically going to come down to splitting the image into a grid of regions and encoding those.

I'm not sure there's any information theoretic intuition to be had with DeepSeek's experiments - it seems to be more about what's the lowest resolution image resolution/grid you can get away with and still capture enough image detail to be able to accurately perform OCR on it.

It'd be cool if Karpathy would extend his NanoChat to be multi-modal to spread the knowledge of how this is typically done.