> What happens to it up the callstack? as they say in the post, these files get ...

hedora · 2025-11-19T03:23:25 1763522605

Given that the bug was elsewhere in the system (the config file parser spuriously failed), it’s hard to justify much of what you suggested.

Panics should be logged, and probably grouped by stack trace for things like prometheus (outside of process). That handles all sorts of panic scenarios, including kernel bugs and hardware errors, which are common at cloudflare scale.

Similarly, mitigating by having rapid restart with backoff outside the process covers far more failure scenarios with far less complexity.

One important scenario your approach misses is “the watch config file endpoint fell over”, which probably would have happened in this outage if 100% of servers went back to watching all of a sudden.

Sure, you could add an error handler for that too, and for prometheus is being slow, and an infinite other things. Or, you could just move process management and reporting out of process.

frumplestlatz · 2025-11-20T08:14:16 1763626456

Writing bad code that doesn’t handle errors and doesn’t correctly model your actual runtime invariants doesn’t simplify anything other than the amount of thought you have to put into writing the code — because you’re writing broken code.

The solution to this problem wasn’t restarting the failing process. It was correctly modeling the failure case, so that then the type system forced you to correctly handle it.