Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The unwrap: not great, but understandable. Better to silently run with a partial config while paging oncall on some other channel, but that's a lot of engineering for a case that apparently is supposed to be "can't happen".

The lack of canary: cause for concern, but I more or less believe Cloudflare when they say this is unavoidable given the use case. Good reason to be extra careful though, which in some ways they weren't.

The slowness to root cause: sheer bad luck, with the status page down and Azure's DDoS yesterday all over the news.

The broken SQL: this is the one that I'd be up in arms about if I worked for Cloudflare. For a system with the power to roll out config to ~all of prod at once while bypassing a lot of the usual change tracking, having this escape testing and review is a major miss.



IMO: there should be explicit error path for invalid configuration, so the program would abort with specific exit code and/or message. And there should be a superviser which would detect this behaviour, rollback old working config and wait for few minutes before trying to apply new config again (of course with corresponding alerts).

So basically bad config should be explicitly processed and handled by rolling back to known working config.


You don’t even need all the ceremony. If the config gets updated every 5 minutes, it surely is being hot-reloaded. If that’s the case, the old config is already in memory when the new config is being parsed. If that’s the case, parsing shouldn’t have panicked, but logged a warning, and carried on with the old config that must already be in memory.


> If that’s the case, the old config is already in memory when the new config is being parsed

I think that's explicitly a non-goal. My understanding is that Cloudflare prefers fail safe (blocking legitimate traffic) over fail open (allowing harmful traffic).


Well, they should then add some reliability goals into the mix too to balance it out a bit.


System outputting the configuration file failed (it could check the size and/or content and stop right away), but also a system importing the file failed. These usually sound simple/stupid in a hindsight. I am not a fan of everything centralising to a few hands. As in bad situation, they can also be weaponised or attacked. And in good situation their blast radius is just too big and a bit random, in this case global.


The query is surely faulty: Even if this wasn’t a huge distributed database with who-knows-what schemas and use cases, looking up a specific table by its unqualified name is sloppy.

But the architectural assumption that the bot file build logic can safely obtain this operationally critical list of features from derivative database metadata vs. a SSOT seems like a bigger problem to me.


It's probably not ok to silently run with a partial config, which could have undefined semantics. An old but complete config is probably ok (or, the system should be designed to be safe to run in this state).


For unwrap, Cloudflare should consider adding lint tooling that prevents unwrap being added to production code.


It’s a feature, not a bug. Assert assumptions and crash on bad one.

Crashing is not an outage. It’s a restart and a stack trace for you to fix.


Even if you want it to crash, you almost never unwrap. At the very least you would use .expect() so you get a reasonable error message -- or even better you handle the potential error.


> Crashing is not an outage.

Are you in the right thread?


Did you skip all the other context about the other systems that failed?

The problem was a query producing incorrect data. The crash helped them find it.

What do you think happens when a program crashes?


> The crash helped them find it.

Barely given the initial impression it was a DDoS.

Although you can argue that's bad observability.


The type system is for asserting assumptions like "this cannot fail". You don't crash at all.


Most properties of programs cannot be validated at compile time and must be checked at runtime.

But you’re still missing it. Crashing is not bad. It’s good. It’s how you leverage OS level security and reliability.


This wasn't a runtime property that could not be validated at compile time. And you don't need to fall back on "OS level security and reliability" when your type system is enforcing an application-level invariants.

In fact I'd argue that crashing is bad. It means you failed to properly enumerate and express your invariants, hit an unanticipated state, and thus had to fail in a way that requires you to give up and fall back on the OS to clean up your process state.

[edit]

Sigh, HN and its "you're posting too much". Here's my reply:

> Why? The end user result is a safe restart and the developer fixes the error.

Look at the thread your commenting on. The end result was a massive world-wide outage.

> That’s what it’s there for. Why is it bad to use its reliable error detection and recovery mechanism?

Because you don't have to crash at all.

> We don’t want to enumerate all possible paths. We want to limit them.

That's the exact same thing. Anything not "limited" is a possible path.

> If my program requires a config file to run, crash as soon as it can’t load the config file. There is nothing useful I can do (assuming that’s true).

Of course there's something useful you can do. In this particular case, the useful thing to do would have been to fall back on the previous valid configuration. And if that failed, the useful thing to do would be to log an informative, useful error so that nobody has to spend four hours during a worldwide outage to figure out what was going wrong.


> I'd argue that crashing is bad.

Why? The end user result is a safe restart and the developer fixes the error.

> fall back on the OS to clean up your process state.

That’s what it’s there for. Why is it bad to use its reliable error detection and recovery mechanism?

> It means you failed to properly enumerate and express your invariants

We don’t want to enumerate all possible paths. We want to prune them.

If my program requires auth info to run, crash as soon as it can’t load it. There is nothing useful I can do (assuming that’s true).


> The end result was a massive world-wide outage.

The world wide outage was actually caused by deploying several incorrect programs in an incorrect system.

The root one was actually a bad query as outlined in the article.

Let’s get philosophical for a second. Programs WILL be written incorrectly - you will deploy to production something that can’t possibly work. What should you do with a program that can’t work? Pretend this can’t happen? Or let you know so you can fix it?


> Programs WILL be written incorrectly - you will deploy to production something that can’t possibly work. What should you do with a program that can’t work? Pretend this can’t happen? Or let you know so you can fix it?

Type systems provide compile time guarantees of correctness such that systems cannot fail in ways covered by the type system.

In this case, they used an unsound hole in the type system to do something that unnecessarily abandoned those compile-time invariants and in the process caused a world-wide outage.

The answer is not to embrace poking unsound holes in your type system in the first place.


It's not philosophical, half of the internet broke. There is a notion between "I really have no other choice but to crash" and "I might wanna crash at this moment because something is wrong but I won't and I'll try recovering".

In this particular case, it was the "limit 200" because "performance reasons" so I think there was more space to implement the latter than the former.


Quite surprising a single bad config file brought down their entire global network across multiple products


Share the same opinion, as others pointed out, the status page down probably caused by bots checking it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: