Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>> This is the multi-million dollar .unwrap() story.

> That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.

Problem is, the enclosing function (`fetch_features`) returns a `Result`, so the `unwrap` on line #82 only serves as a shortcut a developer took due to assuming `features.append_with_names` would never fail. Instead, the routine likely should have worked within `Result`.



> Instead, the routine likely should have worked within `Result`.

But it's a fatal error. It doesn't matter whether it's implicit or explicit, the result is the same.

Maybe you're saying "it's better to be explicit", as a broad generalization I don't disagree with that.

But that has nothing to do with the actual bug here, which was that the invariant failed. How they choose to implement checking and failing the invariant in the semantics of the chosen language is irrelevant.


>> Problem is, the enclosing function (`fetch_features`) returns a `Result`, so the `unwrap` on line #82 only serves as a shortcut a developer took due to assuming `features.append_with_names` would never fail. Instead, the routine likely should have worked within `Result`.

> But it's a fatal error. It doesn't matter whether it's implicit or explicit, the result is the same.

I agree it is an error, but disagree that it should be a fatal error at that location. The reason being is the method defining the offending `unwrap` construct produces a `Result`, which is fully capable of representing any error `features.append_with_names` could produce.

> But that has nothing to do with the actual bug here, which was that the invariant failed.

The bug is by invoking `unwrap` the process crashed. To the degree that Cloudfare had a massive outage.

Had the logic been such that a `Result` representing this error condition activated an alternate workflow to handle the error (perhaps by logging it, emitting a notification event alerting SRE's, transitioning into a failure mode, or all of these options), then a global outage might have been averted.

Which makes:

> How they choose to implement checking and failing the invariant in the semantics of the chosen language is irrelevant.

Very relevant indeed.


A failed config load probably shouldn't be a fatal error if a valid config is already loaded?


Hard to say. Why would you load a new config if a valid config is already loaded?

Maybe the new config has a new update. Who knows? Do we want to keep operating on the old config? Maybe maybe not.

But operating on old config when you don't want to is definitely worse.


Of course it depends on the situation. But I don't see how you could think that in this case, crashing is better than stale config.

Crashing on a config update is usually only done if it could cause data corruption if the configs aren't in sync. That's obviously not the case here since the updates (although distributed in real time) are not coupled between hosts. Such systems usually are replicated state machines where config is totally ordered relative to other commands. Example: database schema and write operations (even here the way many databases are operated they don't strongly couple the two).


Because stale config could easily go unnoticed for a long time.

Crashing is generally better than behaving incorrectly due to stale configs. Because the problem would get fixed faster.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: