Can you expand a bit on what the problem was and what the discussion was like an...

kelseyfrog · on Sept 8, 2024

For sure. The problem was writing an np.unique[1] that could handle large datasets. Specifically, the solution involved chunking the dataset, mapping np.unique across chunks, and then combining chunks. Merging the counts result is an associative operation and merging them in a tree-like computational graph implies O(log n) merges. The end result was being able to perform the calculation in seconds whereas the previous duration was days to weeks. Going from O(n) to O(log n) is magic.

Specifically this is work related to implementing large dataset support for the dedupe library[1]. It's valuable to be able to effectively de-duplicate messy datasets. That's about as much as I can share.

1. https://numpy.org/doc/stable/reference/generated/numpy.uniqu...

2. https://github.com/dedupeio/dedupe/blob/main/dedupe/clusteri...

SkiFire13 · on Sept 8, 2024

> merging them in a tree-like computational graph implies O(log n) merges.

Intuitively this doesn't make sense to me. You have a tree that has N leaves, and you have to perform a merge for each parent. There are however N-1 parents, so you'll still perform O(N) merges.

dskloet · on Sept 8, 2024

Why do you need to merge counts if you look for unique elements? Isn't the count always 1 for each element present? Isn't the key to shard your chunks so each element can appear in at most one chunk?

ReleaseCandidat · on Sept 8, 2024

Actually what you are using is an equivalence relation and you are generating the equivalence classes of your set, "your" associativity is transitivity. Whenever you are comparing something it's almost always better to think of it as a relation instead of an algebraic structure. So you already know beforehand that your algorithm should be reflexive, symmetric and transitive (in theory, of course with computers there is numerics involved ;)

The steps to the solution are something like:

- I need all distinct elements of X

- Oh, that's a quotient set (the partition) of X by a/the equivalence relation (`==`)

- so my algorithm must be reflexive (yeah, trivially), symmetric (not so helpful) and transitive - now this I can use (together with the symmetry)

It's generally easier if you know beforehand that it must be e.g. transitive.