I wonder if text written using chinese characters is more compatible with such v...

Werkzeug · 2025-10-21T11:32:47 1761046367

I think it's not the case. Chinese characters have the highest information entropy of all writing systems. However, Chinese characters are all independent symbols, which means if you want the LLM to support 5000 Chinese characters, you need to put 5000 characters into the lookup table (obviously there's no root, prefix, and suffix in Chinese, you cannot split the character into multiple reusable word pieces). As a result, you may need fewer characters to represent the same meaning compared to latin languages, but LLMs may also need to activate more token embeddings.