So, here's a small blurb to clear up some misunderstandings. Excuse the typos since I'm writing in a bit of a hurry.
Most of the data processing is actually done in CLI tools we created and not in Azure SQL HyperScale. That includes things like:
* Extracting email addresses (from either csv or other delimited files). This can be problematic because it turns out people that gather this data aren't always thinking about encoding etc. so very often we need to jump through hoops to get it to work mostly correct. And when we see files like this huge breach that contains multi-terabyte CSVs, you need a tool that is both fast and memory efficient. For this breach we actually wrote our own to do this since other tools we tried often choked with out-of-memory errors or simply ran too slow. We will likely be open sourcing some of these tools.
* Extracting emails is one thing, extracting passwords is another and has totally different requirements.
Emails we need to extract in a case-insensitive way, for stealer logs we also need to parse the domains associated with the email. We need to hash the email because the hashes are used for k-anonymity purposes as well as batching purposes for internal processes.
Passwords we need to also hash (SHA1 and NTLM) for Pwned Passwords, but that also needs to make sure that we use consistent encoding. We also need to dedupe AND count them for prevalence purposes.
This we can all mostly do without touching Azure SQL HyperScale.
Once we have prepared the data, it needs to be inserted into the DB. It's not a case of creating simple binary lookup tables because we have different requirements and have at least three different ways of looking up an email address.
1. The full email (alias + domain)
2. Domain lookups (just domain)
3. K-anonymity lookups (first 6 chars of the SHA1 of the full email address)
This requires emails to not just be indexed based on the alias and the domain (which we denormalize into a domain-id). We also need indexes on things like the SHA1 prefix and we need to take into account when people have opted out of having their emails loaded.
Reasons for Azure SQL HyperScale:
The email and domain search data used to be stored in Azure Table Storage. It was very convenient since it was fast to look up (partition keys and row keys) and cheap to store. There was one big drawback though. Azure Table Storage has no backup or point-in-time restore strategy. The only way to back up/restore data is to download it and reupload as a restore mechanism. Which is easy enough, except downloading the data was starting to take a week, even running in a VM in the same datacenter as the Table Storage account. And for a service like Have I Been Pwned, if we had a disaster or messed up a breahc load and had to roll-back, taking everything offline or having the wrong data for a week is unacceptable.
That's where Azure SQL HyperScale came in. The reason it was picked is not because Troy has a Microsoft RD or me being an MVP. We simply picked it because we both know MS SQL very well, we have good access to people that know it even better than us (for support purposes) and it has a very good, tried and tested backup/restore scenarios.
We do know that there are certainly better DBs that we can use, and it would probably be cheaper to run our own Postgres on our own hardware, or something on that note, but since it's just two people actively working on this and we hardly have time for development of new features and breach loads as it is, we simply couldn't spend valuable time on learning the ins and outs of a new DB engine, what it takes to run/maintain/optimize and all the other SRE responsibilities that come with it.
So in the end, it came down to convenience and what our time is best spent on doing.
Rest assured though, with everything we learned processing this breach, we will be much quicker to process the next really large breach, since we have taken a ton of learnings, new tools and processes that we'll be implementing. I'd expect the next breach of this size to take just a couple of days to process. Most other breaches take a lot less since they are a fraction of the size of this one.
Binary files:
Pwned passwords is currently stored in blob storage containing just the first 5 chars of the hash in the filename and the rest in a line delimited, ordered fashion. I have already done some tests on having them binary files (since the hashes are always a fixed size, and the prevlance is just an int). So we could technically have each hash entry be 17 bytes (rest of the hash) + 4 bytes for the prevalence (unsigned 32-bit int) so just 21 bytes for each hash entry, and we skip newlines. And we might actually go that route in the not to distant future since it's easy to do.
Hope that clears up some of our thoughts here :) I'm planning on writing a blog soon with most of the things we learned so that might shed further light and insights on how we process this.
All fair points and obviously well thought out by knowledgeable people. I approach problem solving the same way, accounting for staff familiarity with tools, not just the wall-clock time taken for some esoteric approach.
My observation is that in the last year or so the relative weight of these contributions has shifted massively because of AI code authoring.
It’s so fast and easy to whip up a few hundred lines of code with something like Gemini Pro 2.5 that I got it to make me a sorting benchmark tool just so I’d have a data point for a comment in this thread! I never would have had the time years ago.
For relatively “small” and isolated problems like password hash lookup tables, it’s amazing what you can do in mere hours with AI assistance.
If I was approaching this same problem just two years ago I would have picked SQL Hyperscale too, for the same(ish) reasons.
Now? I feel like many more avenues have been opened up…
So, here's a small blurb to clear up some misunderstandings. Excuse the typos since I'm writing in a bit of a hurry.
Most of the data processing is actually done in CLI tools we created and not in Azure SQL HyperScale. That includes things like:
* Extracting email addresses (from either csv or other delimited files). This can be problematic because it turns out people that gather this data aren't always thinking about encoding etc. so very often we need to jump through hoops to get it to work mostly correct. And when we see files like this huge breach that contains multi-terabyte CSVs, you need a tool that is both fast and memory efficient. For this breach we actually wrote our own to do this since other tools we tried often choked with out-of-memory errors or simply ran too slow. We will likely be open sourcing some of these tools.
* Extracting emails is one thing, extracting passwords is another and has totally different requirements.
Emails we need to extract in a case-insensitive way, for stealer logs we also need to parse the domains associated with the email. We need to hash the email because the hashes are used for k-anonymity purposes as well as batching purposes for internal processes.
Passwords we need to also hash (SHA1 and NTLM) for Pwned Passwords, but that also needs to make sure that we use consistent encoding. We also need to dedupe AND count them for prevalence purposes.
This we can all mostly do without touching Azure SQL HyperScale.
Once we have prepared the data, it needs to be inserted into the DB. It's not a case of creating simple binary lookup tables because we have different requirements and have at least three different ways of looking up an email address.
1. The full email (alias + domain)
2. Domain lookups (just domain)
3. K-anonymity lookups (first 6 chars of the SHA1 of the full email address)
This requires emails to not just be indexed based on the alias and the domain (which we denormalize into a domain-id). We also need indexes on things like the SHA1 prefix and we need to take into account when people have opted out of having their emails loaded.
Reasons for Azure SQL HyperScale: The email and domain search data used to be stored in Azure Table Storage. It was very convenient since it was fast to look up (partition keys and row keys) and cheap to store. There was one big drawback though. Azure Table Storage has no backup or point-in-time restore strategy. The only way to back up/restore data is to download it and reupload as a restore mechanism. Which is easy enough, except downloading the data was starting to take a week, even running in a VM in the same datacenter as the Table Storage account. And for a service like Have I Been Pwned, if we had a disaster or messed up a breahc load and had to roll-back, taking everything offline or having the wrong data for a week is unacceptable.
That's where Azure SQL HyperScale came in. The reason it was picked is not because Troy has a Microsoft RD or me being an MVP. We simply picked it because we both know MS SQL very well, we have good access to people that know it even better than us (for support purposes) and it has a very good, tried and tested backup/restore scenarios.
We do know that there are certainly better DBs that we can use, and it would probably be cheaper to run our own Postgres on our own hardware, or something on that note, but since it's just two people actively working on this and we hardly have time for development of new features and breach loads as it is, we simply couldn't spend valuable time on learning the ins and outs of a new DB engine, what it takes to run/maintain/optimize and all the other SRE responsibilities that come with it.
So in the end, it came down to convenience and what our time is best spent on doing.
Rest assured though, with everything we learned processing this breach, we will be much quicker to process the next really large breach, since we have taken a ton of learnings, new tools and processes that we'll be implementing. I'd expect the next breach of this size to take just a couple of days to process. Most other breaches take a lot less since they are a fraction of the size of this one.
Binary files: Pwned passwords is currently stored in blob storage containing just the first 5 chars of the hash in the filename and the rest in a line delimited, ordered fashion. I have already done some tests on having them binary files (since the hashes are always a fixed size, and the prevlance is just an int). So we could technically have each hash entry be 17 bytes (rest of the hash) + 4 bytes for the prevalence (unsigned 32-bit int) so just 21 bytes for each hash entry, and we skip newlines. And we might actually go that route in the not to distant future since it's easy to do.
Hope that clears up some of our thoughts here :) I'm planning on writing a blog soon with most of the things we learned so that might shed further light and insights on how we process this.