Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

I don't use Rust, but a lot of Rust people say if it compiles it runs.

Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.

end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.

there's no bad language - just occassional hiccups from us users who use those tools.



You misunderstand what Rust’s guarantees are. Rust has never promised to solve or protect programmers from logical or poor programming. In fact, no such language can do that, not even Haskell.

Unwrapping is a very powerful and important assertion to make in Rust whereby the programmer explicitly states that the value within will not be an error, otherwise panic. This is a contract between the author and the runtime. As you mentioned, this is a human failure, not a language failure.

Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought… (n.b. nginx)

This is the classic example of when something fails, the failure cause over indexes on - while under indexing on the quadrillions of memory accesses that went off without a single hitch thanks to the borrow checker.

I postulate that whatever the cost in millions or hundreds of millions of dollars by this Cloudflare outage, it has paid for more than by the savings of safe memory access.

See: https://en.wikipedia.org/wiki/Survivorship_bias


> Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought

I mean thats an unfalsifiable statement, not really fair. C is used to successfully launch spaceships.

Whereas we have a real Rust bug that crashed a good portion of the internet for a significant amount of time. If this was a C++ service everyone would be blaming the language, but somehow Rust evangelicals are quick to blame it on "unidiomatic Rust code".

A language that lets this easily happen is a poorly designed language. Saying you need to ban a commonly used method in all production code is broken.


Only formal proof languages are immune to such properties. Therefore all languages are poorly designed by your metric.

Consider that the set of possible failures enabled by language design should be as small as possible.

Rust's set is small enough while also being productive. Until another breakthrough in language design as impactful as the borrow checker is invented, I don't imagine more programmers will be able to write such a large amount of safe code.


I would say the impact of the borrow checker is exaggerated.


> You misunderstand what Rust’s guarantees are.

Well, no, most Rust programmers misunderstand what the guarantees are because they keep parroting this quote. Obviously the language does not protect you from logic errors, so saying "if it compiles, it works" is disingenuous, when really what they mean is "if it compiles, it's probably free of memory errors".


No, the "if it compiles, it works" is genuinely about the program being correct rather than just free of memory errors, but it's more of a hyperbolic statement than a statement of fact.

It's a common thing I've experienced and seen a lot of others say that the stricter the language is in what it accepts the more likely it is to be correct by the time you get it to run. It's not just a Rust thing (although I think Rust is _stricter_ and therefore this does hold true more of the time), it's something I've also experienced with C++ and Haskell.

So no, it's not a guarantee, but that quote was never about Rust's guarantees.


Everyone understands Rust doesn't offer such guarantees.

Even more now after this outage.

But it's a fact that "if it compiles it runs" is often associated with Rust, in HN at least. A quick Algolia search tells me that.


I have definitely noticed this when I've tried doing Advent of Code in Rust - by the time my code compiles it typically send out the right answer. It doesn't help me once I don't know whatever algorithm I need to reach for in order to solve it before the heat death of the universe, but it is a somewhat magical feeling when it lasts.


> Rust won't save you from the usual programming mistake.

Disagree. Rust is at least giving you an "are you sure?" moment here. Calling unwrap() should be a red flag, something that a code reviewer asks you to explain; you can have a linter forbid it entirely if you like.

No language will prevent you from writing broken code if you're determined to do so, and no language is impossible to write correct code in if you make a superhuman effort. But most of life happens in the middle, and tools like Rust make a huge difference to how often a small mistake snowballs into a big one.


> Disagree. Rust is at least giving you an "are you sure?" moment here. Calling unwrap() should be a red flag, something that a code reviewer asks you to explain; you can have a linter forbid it entirely if you like.

No one treats it like that and nearly every Rust project is filled with unwraps all over the place even in production system like Cloudflare's.


Well let me avoid those that don’t understand it. It’s literally Rust 101.


It's literally not, Rust tutorials are littered with `.unwrap()` calls. It might be Rust 102, but the first impression given is that the language is surprisingly happy with it.


https://doc.rust-lang.org/book/ch09-02-recoverable-errors-wi...

If you haven't read the Rust Book at least, which is effectively Rust 101, you should not be writing Rust professionally. It has a chapter explaining all of this.


> In production-quality code, most Rustaceans choose expect rather than unwrap and give more context about why the operation is expected to always succeed. That way, if your assumptions are ever proven wrong, you have more information to use in debugging.

I didn't read anything in that section about unwrap/expect that it shouldn't be used in production code. If anything I read it as perfectly acceptable.


I've worked on commercial codebases that did better, shrug.


Yep, unwrap() and unsafe are escape hatches that need very good justifications. It's fine for casual scripts where you don't care if it crashes. For serious production software they should be either banned, or require immense scrutiny.


> you can have a linter forbid it entirely if you like.

It would be better if that would be the other way round "linter forbids it unless you ask it not to". Never wrong to allow users to shoot themself in the foot, but it should be explicit.


> Well Rust won't save you from the usual programming mistake

This is not a Rust problem. Someone consciously chose to NOT handle an error, possibly thinking "this will never happen". Then someone else conconciouly reviewed (I hope so) a PR with an unwrap() and let it slide.


And people doing testing failed to ignore their excuse of this never happening and still testing it. With this kind of systems you need the separate group that just ignores any "this will never happen" and still checks what happens if it does.

Now it might be that it was tested, but then ignored or deprioritised by management...


What people are saying is that idiomatic prod rust doesn't use unwrap/expect (both of which panic on the "exceptional" arm of the value) --- instead you "match" on the value and kick the can up a layer on the call chain.


What happens to it up the callstack? Say they propagated it up the stack with `?`. It has to get handled somewhere. If you don't introduce any logic to handle the duplicate databases, what else are you going to do when the types don't match up besides `unwrap`ing, or maybe emitting a slightly better error message? You could maybe ignore that module's error for that request, but if it was a service more critical than bot mitigation you'd still have the same symptom of getting 500'd.


> What happens to it up the callstack?

as they say in the post, these files get generated every 5 minutes and rolled out across their fleet.

so in this case, the thing farther up the callstack is a "watch for updated files and ingest them" component.

that component, when it receives the error, can simply continue using the existing file it loaded 5 minutes earlier.

and then it can increment a Prometheus metric (or similar) representing "count of errors from attempting to load the definition file". that metric should be zero in normal conditions, so it's easy to write an alert rule to notify the appropriate team that the definitions are broken in some way.

that's not a complete solution - in particular it doesn't necessarily solve the problem of needing to scale up the fleet, because freshly-started instances won't have a "previous good" definition file loaded. but it does allow for the existing instances to fail gracefully into a degraded state.

in my experience, on a large enough system, "this could never happen, so if it does it's fine to just crash" is almost always better served by a metric for "count of how many times a thing that could never happen has happened" and a corresponding "that should happen zero times" alert rule.


Given that the bug was elsewhere in the system (the config file parser spuriously failed), it’s hard to justify much of what you suggested.

Panics should be logged, and probably grouped by stack trace for things like prometheus (outside of process). That handles all sorts of panic scenarios, including kernel bugs and hardware errors, which are common at cloudflare scale.

Similarly, mitigating by having rapid restart with backoff outside the process covers far more failure scenarios with far less complexity.

One important scenario your approach misses is “the watch config file endpoint fell over”, which probably would have happened in this outage if 100% of servers went back to watching all of a sudden.

Sure, you could add an error handler for that too, and for prometheus is being slow, and an infinite other things. Or, you could just move process management and reporting out of process.


Writing bad code that doesn’t handle errors and doesn’t correctly model your actual runtime invariants doesn’t simplify anything other than the amount of thought you have to put into writing the code — because you’re writing broken code.

The solution to this problem wasn’t restarting the failing process. It was correctly modeling the failure case, so that then the type system forced you to correctly handle it.


The way I’ve seen this on a few older systems was that they always keep the previous configuration around so it can switch back. The logic is something like this:

1. At startup, load the last known good config.

2. When signaled, load the new config.

3. When that passes validation, update the last-known-good pointer to the new version.

That way something like this makes the crash recoverable on the theory that stale config is better than the service staying down. One variant also recorded the last tried config version so it wouldn’t even attempt to parse the latest one until it was changed again.

For Cloudflare, it’d be tempting to have step #3 be after 5 minutes or so to catch stuff which crashes soon but not instantly.


Presumably you kick up the error to a level that says “if parsing new config fails, keep the old config”


The config file subsystem was where the bug lived, not the code with the unwrap, so this sort of change is a special case of “make the unwrap never fail and then fix the API so it is not needed”.


Yeah, see, that's what I mean.


"if it compiles it runs" - this is indeed an inaccurate marketing slogan. A more precise formulation would be "if it compiles then the static type system, pattern matching, explicit errors, Send bounds, etc. will have caught a lot of bugs that in other languages would have manifested as runtime errors".

Anecdotally I can write code for several hours, deploy it to a test sandbox without review or running tests and it will run well enough to use it, without silly errors like null pointer exceptions, type mismatches, OOBs etc. That doesn't mean it's bug-free. But it doesn't immediately crash and burn either. Recently I even introduced a bug that I didn't immediately notice because careful error handling in another place recovered from it.


> I don't use Rust, but a lot of Rust people say if it compiles it runs.

Do you grok what the issue was with the unwrap, though...?

Idiomatic Rust code does not use that. The fact that it's allowed in a codebase says more about the engineering practices of that particular project/module/whatever. Whoever put the `unwrap` call there had to contend with the notion that it could panic and they still chose to do it.

It's a programmer error, but Rust at least forces you to recognize "okay, I'm going to be an idiot here". There is real value in that.


While I agree that Rust got it right by being more explicit, a lot of bugs in C/C++ can also easily avoided with good engineering practices. The Rust argument that it is mainly the fault of the programming language with C/C++ was always a huge and unfair exaggeration. Now with this entirely predictable ".unwrap" desaster (in general, not necessarily this exact scenarious), the "no true Rustacean would have put unwrap in production" fallacy is sad and funny at the same time.


Unwrap is controversial. The problem is that if you remove it, it makes the bar even higher for newcomers to Rust. One solution is to make it unsafe (along with panic).


> the "no true Rustacean would have put unwrap in production"

The "no unwrap" rule is common in most production codebases. Chill.


Could you point one that is Open source?


other people might say - why use unsafe rust - but we don't know the conditions of what the original code shipped under. why the pr was approved.

could have been tight deadline, managerial pressure or just the occasional slip up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: