My guess would be that it's some aspect of *measuring* the text that is causing ...

Manishearth · on Feb 15, 2018

Well, the crash occurs for Spotlight without me clicking anything or having any cursors anywhere.

But yeah, this is one of my theories about it. One of the previous crashes had to do with an Arabic string which got longer when you truncated it, which made snipping it to display in a notification have bugs.

It's interesting to see it's causing a segfault, i'd expect measuring bugs to cause clean assertions or shitty rendering. Which is why I'm also wondering if it's actually a disagreement on the number of "characters" in the rendered things.

> If measuring the sub-strings gives surprising results (sub-strings being visibly longer for example), this could cause the algorithm to fail in any number of interesting ways: for example if a binary search is used to locate the cursor position, it could break the invariants of the binary search.

Cursor positions are based off of grapheme clusters -- there's a defined algorithm for that. Though different parts of the system may disagree on the specifics of the algorithm causing such a crash.

However, that doesn't gel with the fact that it's only specific consonants causing this, all versions of UAX 29 do not consider any differences between Indic consonants for a single given script.

rlanday · on Feb 15, 2018

UAX 29 doesn't really describe how to handle Indic text very well.

http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

> Grapheme clusters can be tailored to meet further requirements. Such tailoring is permitted, but the possible rules are outside of the scope of this document. One example of such a tailoring would be for the aksaras, or orthographic syllables, used in many Indic scripts. Aksaras usually consist of a consonant, sometimes with an inherent vowel and sometimes followed by an explicit, dependent vowel whose rendering may end up on any side of the consonant letter base. Extended grapheme clusters include such simple combinations.

> However, aksaras may also include one or more additional prefixed consonants, typically with a virama (halant) character between each pair of consonants in the sequence. Such consonant cluster aksaras are not incorporated into the default rules for extended grapheme clusters, in part because not all such sequences are considered to be single “characters” by users. Indic scripts vary considerably in how they handle the rendering of such aksaras—in some cases stacking them up into combined forms known as consonant conjuncts, and in other cases stringing them out horizontally, with visible renditions of the halant on each consonant in the sequence. There is even greater variability in how the typical liquid consonants (or “medials”), ya, ra, la, and wa, are handled for display in combinations in aksaras. So tailorings for aksaras may need to be script-, language-, font-, or context-specific to be useful.

For example, in Chrome, we added an extra rule to not allow grapheme clusters to be split after Indic virama characters, but later had to modify the rule to not apply to Tamil viramas:

https://chromium-review.googlesource.com/c/chromium/src/+/84...

I don't know the exact cause of this crash, but I can see why Apple might be running into trouble with their logic for these languages. I suspect their algorithm for computing grapheme clusters has a bug causing an inconsistency somewhere.

Manishearth · on Feb 15, 2018

Yeah, I'm aware, I've been arguing for UAX 29 to handle consonant clusters for a while. The current draft has handling for it: http://www.unicode.org/reports/tr29/tr29-32.html#Virama

However, given that some Brahmic scripts prefer explicit viramas (Malyalam, also Thai I think), this will probably be restricted to Brahmic scripts where joining is always preferred (even if not possible).

I'd been testing UAX 29 stuff out before and Apple seems to follow the spec. For example, Chrome and Firefox seem to do special handling for e.g. flag emoji (distinguishing between regional indicator pairs that render as a flag vs those which don't -- i.e the ones which don't correspond to a country code). But Apple follows the spec rigidly. In particular it does not consider joined consonants to form a single EGC.

I could be wrong on that, though.

jcheng · on Feb 16, 2018

Nit: it's "Malayalam". I noticed the same typo in your blog post.

Manishearth · on Feb 16, 2018

lol I keep making this mistake. Thanks.

jcranmer · on Feb 15, 2018

I'd have to dump strings as grapheme clusters, but it's quite possible that there's some grapheme cluster weirdness going on here.

goalieca · on Feb 15, 2018

I believe the renderer is a brand new metal 2 for high sierra. The crashing 3rd party apps for iOS are using this new rendering too?

jfk13 · on Feb 15, 2018

It can crash on (non-high) Sierra, too. The crash is in Core Text shaping, rather than in actual painting, so somewhat above the metal level IIUC.

blackflame7000 · on Feb 15, 2018

Auto-Correct Perhaps?