Excellent, thanks. So basically this is saying: "our pixels-to-token encoding is so efficient (information density in a set of "image tokens" is much higher as compared to a set of text tokens), why even bother representing text as text?"
Basically. Some people are even saying, hey, if you encode text as an image then you don’t need tokenizers any more, and you get more expressivity from the graphic styling.
Another takeaway is that you don’t need to pass a tensor of shape (batch_size, sequence_length, d_model) through your transformer. Not every token needs its own dedicated latent embedding. You can presumably get away with dividing sequence_length by a constant.
This isn’t super ground breaking but it does reinforce the validity of a middle ground between recurrent models, where context is compressed into a single “memory token”, and transformers, where context is uncompressed. 1 < n/k < n
Correct?