Unicode allows for 17 planes, each of 65,536 possible characters (or 'code point...

jfk13 · on Feb 15, 2018

Rendering Indic scripts with AAT fonts involves a series of finite state machines that are stored in the individual font. So don't forget to multiply by the number of different fonts that each need to be tested.

zokier · on Feb 15, 2018

> Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112.

Allows for 17 planes, but only small portion of those are actually used. According to Wikipedia[1], currently Unicode has 148944 codepoints + 128k private use ones (which might, or might not make sense to include in unit tests). So your time estimate is off by mere 5 orders of magnitude.

nikanj · on Feb 15, 2018

(148944 ^ 5) milliseconds to years = 2.324 quadrillion years .

Doesn't really change the result, imho.

ancarda · on Feb 15, 2018

Do you know how was this bug found?

Are there just enough people using iOS that these sorts of bugs can be found by mistake, or is someone fuzzing CoreText? Perhaps that can be applied to provide some kind of test coverage? Even if it’s not complete?

perkee · on Feb 15, 2018

This sequence begins the Telugu word for "knowledge" so maybe someone texted that to someone and it went viral from there. This is, of course, only speculation.

bonzini · on Feb 15, 2018

Does the word include the zwnj character? How do you input it?

Manishearth · on Feb 15, 2018

It does not include the zwnj, that somehow snuck in. Most keyboards don't support directly inputting a zwnj, but may support it for specific combinations. For example my Marathi keyboard supports typing eyelash rephs (e.g. in र्‍क) which includes zwj.

However I'm not aware of any such things in Telugu aside from explicit virama-showing which rarely exists in input methods (and doesn't end up with zwnj in the position shown here, but that could have happened after editing).

perkee · on Feb 15, 2018

Huge oversight on my part, the string starts with 0xC1C, 0xC4D, 0xC1E, 0xC3E i.e. without the ZWNJ. I'm stumped

https://en.wiktionary.org/wiki/%E0%B0%9C%E0%B1%8D%E0%B0%9E%E...

pokpokpok · on Feb 15, 2018

true, but characters could be mapped to equivalence categories according to logic in the code