My guess would be that it's some aspect of measuring the text that is causing the crash: when you click in an editable text box, there is code to track down where the cursor should be placed. This is done by measuring various sub-strings of the whole line.
If measuring the sub-strings gives surprising results (sub-strings being visibly longer for example), this could cause the algorithm to fail in any number of interesting ways: for example if a binary search is used to locate the cursor position, it could break the invariants of the binary search.
Well, the crash occurs for Spotlight without me clicking anything or having any cursors anywhere.
But yeah, this is one of my theories about it. One of the previous crashes had to do with an Arabic string which got longer when you truncated it, which made snipping it to display in a notification have bugs.
It's interesting to see it's causing a segfault, i'd expect measuring bugs to cause clean assertions or shitty rendering. Which is why I'm also wondering if it's actually a disagreement on the number of "characters" in the rendered things.
> If measuring the sub-strings gives surprising results (sub-strings being visibly longer for example), this could cause the algorithm to fail in any number of interesting ways: for example if a binary search is used to locate the cursor position, it could break the invariants of the binary search.
Cursor positions are based off of grapheme clusters -- there's a defined algorithm for that. Though different parts of the system may disagree on the specifics of the algorithm causing such a crash.
However, that doesn't gel with the fact that it's only specific consonants causing this, all versions of UAX 29 do not consider any differences between Indic consonants for a single given script.
> Grapheme clusters can be tailored to meet further requirements. Such tailoring is permitted, but the possible rules are outside of the scope of this document. One example of such a tailoring would be for the aksaras, or orthographic syllables, used in many Indic scripts. Aksaras usually consist of a consonant, sometimes with an inherent vowel and sometimes followed by an explicit, dependent vowel whose rendering may end up on any side of the consonant letter base. Extended grapheme clusters include such simple combinations.
> However, aksaras may also include one or more additional prefixed consonants, typically with a virama (halant) character between each pair of consonants in the sequence. Such consonant cluster aksaras are not incorporated into the default rules for extended grapheme clusters, in part because not all such sequences are considered to be single “characters” by users. Indic scripts vary considerably in how they handle the rendering of such aksaras—in some cases stacking them up into combined forms known as consonant conjuncts, and in other cases stringing them out horizontally, with visible renditions of the halant on each consonant in the sequence. There is even greater variability in how the typical liquid consonants (or “medials”), ya, ra, la, and wa, are handled for display in combinations in aksaras. So tailorings for aksaras may need to be script-, language-, font-, or context-specific to be useful.
For example, in Chrome, we added an extra rule to not allow grapheme clusters to be split after Indic virama characters, but later had to modify the rule to not apply to Tamil viramas:
I don't know the exact cause of this crash, but I can see why Apple might be running into trouble with their logic for these languages. I suspect their algorithm for computing grapheme clusters has a bug causing an inconsistency somewhere.
However, given that some Brahmic scripts prefer explicit viramas (Malyalam, also Thai I think), this will probably be restricted to Brahmic scripts where joining is always preferred (even if not possible).
I'd been testing UAX 29 stuff out before and Apple seems to follow the spec. For example, Chrome and Firefox seem to do special handling for e.g. flag emoji (distinguishing between regional indicator pairs that render as a flag vs those which don't -- i.e the ones which don't correspond to a country code). But Apple follows the spec rigidly. In particular it does not consider joined consonants to form a single EGC.
If measuring the sub-strings gives surprising results (sub-strings being visibly longer for example), this could cause the algorithm to fail in any number of interesting ways: for example if a binary search is used to locate the cursor position, it could break the invariants of the binary search.