Excellent, thanks. So basically this is saying: "our pixels-to-token encoding is...

intalentive · 2025-10-21T16:30:34 1761064234

Basically. Some people are even saying, hey, if you encode text as an image then you don’t need tokenizers any more, and you get more expressivity from the graphic styling.

Another takeaway is that you don’t need to pass a tensor of shape (batch_size, sequence_length, d_model) through your transformer. Not every token needs its own dedicated latent embedding. You can presumably get away with dividing sequence_length by a constant.

This isn’t super ground breaking but it does reinforce the validity of a middle ground between recurrent models, where context is compressed into a single “memory token”, and transformers, where context is uncompressed. 1 < n/k < n