Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This thread warms my heart. Rust has set a new baseline that many and myself now take for granted.

We are now discussing what can be done to improve code correctness beyond memory and thread safety. I am excited for what is to come.



Really not! This is a huge faceplant for writing things in Rust. If they had been writing their code in Java/Kotlin instead of Rust, this outage either wouldn't have happened at all (a failure to load a new config would have been caught by a defensive exception handler), or would have been resolved in minutes instead of hours.

The most useful thing exceptions give you is not static compile time checking, it's the stack trace, error message, causal chain and ability to catch errors at the right level of abstraction. Rust's panics give you none of that.

Look at the error message Cloudflare's engineers were faced with:

     thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
That's useless, barely better than "segmentation fault". No wonder it took so long to track down what was happening.

A proxy stack written in a managed language with exceptions would have given an error message like this:

    com.cloudflare.proxy.botfeatures.TooManyFeaturesException: 200 > 60
        at com.cloudflare.proxy.botfeatures.FeatureLoader(FeatureLoader.java:123)
        at ...
and so on. It'd have been immediately apparent what went wrong. The bad configs could have been rolled back in minutes instead of hours.

In the past I've been able to diagnose production problems based on stack traces so many times I was been expecting an outage like this ever since the trend away from providing exceptions in new languages in the 2010s. A decade ago I wrote a defense of the feature and I hope we can now have a proper discussion about adding exceptions back to languages that need them (primarily Go and Rust):

https://blog.plan99.net/what-s-wrong-with-exceptions-nothing...


That has nothing to do with exceptions, just the ability to unwind the stack. Rust can certainly give you a backtrace on panics; you don’t even have to write a handler to get it. I would find it hard to believe Cloudflare’s services aren’t configured to do it. I suspect they just didn’t put the entire message in the post.


https://doc.rust-lang.org/std/backtrace/index.html#environme...

tldr: Capturing a backtrace can be a quite expensive runtime operation, so the environment variables allow either forcibly disabling this runtime performance hit or allow selectively enabling it in some programs.

By default it is disabled in release mode.


It's one of the problems with using result types. You don't distinguish between genuinely exceptional events and things that are expected to happen often on hot paths, so the runtime doesn't know how much data to collect.


panic is the exceptional event. It so happens that rust doesn't print a stacktrace in release unless configured to do so.

Similarly, capturing a stack trace in a error type (within a Result for example) is perfectly possible. But this is a choice left to the programmer, because capturing a trace is not cheap.


There's clearly a big gap in how things are done in practice. You wouldn't see anyone call System.exit in a managed language if a data file was bigger than expected. You'd always get an exception.

I used to be an SRE at Google. Back then we also had big outages caused by bad data files pushed to prod. It's a common enough issue so I really sympathize with Cloudflare, it's not nice to be on call for issues like that. But Google's prod environments always generated stack traces for every kind of failure, including CHECK failures (panics) in C++. You could also reflect the stack traces of every thread via HTTP. I used to diagnose bugs in production under time pressure quite regularly using just these tools. You always need detailed diagnostics.

Languages shouldn't have panics, tbh, it's a primitive concept. It so rarely makes sense to handle errors that way. I know there's a whole body of Rust/Go lore claiming panics are fine, but it's not a good move and is one of the reasons I've stayed away from Go over the years and wouldn't use Rust for anything higher than low level embedded components or operating system code that has to export a C ABI. You always want diagnostics and recoverable errors; this kind of micro-optimization doesn't make sense outside of extremely constrained embedded environments that very few of us work in.


A panic in Rust is the same as an exception in C++. You can catch it all the same.

https://doc.rust-lang.org/std/panic/index.html

An uncaught exception in C++ or an uncaught panic in Rust terminates the program. The unwinding is the same mechanism. I think the implementation is what comes with LLVM, but I haven't checked.

I was also a Google SRE, and I liked the stacktrace facilities so much that I got permission to open source a library inspired from it: https://github.com/bombela/backward-cpp (I know I am not doing a great job maintaining it)

At Uber I implemented a similar stackrace introspection for RPC tasks via HTTP for Go services.

You can also catch a Go panic. Which we did in our RPC library at Uber.

It would be great for all of that to somehow come ready made though. A sort of flag "this program is a service, turn on all the good diagnostics, here is my main loop".


OK, so the issue is frameworks not catching panics and logging proper stack traces? Very cool that you made a library.


Alternatively you can look at actually innovative programming languages to peek at the next 20 years of innovation.

I am not sure that watching the trendy forefront successfully reach the 1990s and discuss how unwrapping Option is potentially dangerous really warm my heart. I can’t wait for the complete meltdown when they discover effect systems in 2040.

To be more serious, this kind of incident is yet another reminder that software development remains miles away from proper engineering and even key providers like Cloudfare utterly fail at proper risk management.

Celebrating because there is now one popular language using static analysis for memory safety feels to me like being happy we now teach people to swim before a transatlantic boat crossing while we refuse to actually install life boats.

To me the situation has barely changed. The industry has been refusing to put in place strong reliability practices for decades, keeps significantly under investing in tools mitigating errors outside of a few fields where safety was already taken seriously before software was a thing and keeps hiding behind the excuse that we need to move fast and safety is too complex and costly while regulation remains extremely lenient.

I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?


Let's try to make effect systems happen quicker than that.

> I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?

Probably nothing, because most paying customers of cloudflare are probably signing away their rights to sue Cloudflare for damages by being down for a while when they purchase Cloudflare's services (maybe some customers have SLAs with monetary values attached, I dunno). I honestly have a hard time suggesting that those customers are individually wrong to do so - Cloudflare isn't down that often, and whatever amount it cost any individual customer by being down today might be more than offset by the DDOS protection they're buying.

Anyway if you want Cloudflare regulated to prevent this, name the specific regulations you want to see. Should it be illegal under US law to use `unwrap` in Rust code? Should it be illegal for any single internet services company to have more than X number of customers? A lot of the internet also breaks when AWS goes down because many people like to use AWS, so maybe they should be included in this regulatory framework too.


> I honestly have a hard time suggesting that those customers are individually wrong to do so - Cloudflare isn't down that often, and whatever amount it cost any individual customer by being down today might be more than offset by the DDOS protection they're buying.

We have collectively agreed to a world where software service providers have no incentive to be reliable as they are shielded from the consequences of their mistakes and somehow we see it as acceptable that software have a ton of issues and defects. The side effect is that research on actually lowering the cost of safety has little return on investment. It doesn't have be so.

> Anyway if you want Cloudflare regulated to prevent this, name the specific regulations you want to see.

I want software provider to be liable for the damage they cause and minimum quality regulation on par with an actual engineering discipline. I have always been astounded that nearly all software licences start with extremely broad limitation of liability provisions and people somehow feel fine with it. Try to extend that to any other product you regularly use in your life and see how that makes you fell.

How to do proper testing, formal methods and resilient design have been known for decades. I would personnaly be more than okay with let's move less fast and stop breaking things.


> I want software provider to be liable for the damage they cause and minimum quality regulation on par with an actual engineering discipline. I have always been astounded that nearly all software licences start with extremely broad limitation of liability provisions and people somehow feel fine with it. Try to extend that to any other product you regularly use in your life and see how that makes you fell.

So do you want to make it illegal to punish GNU GPL licensed software because that license has a warranty disclaimer? Do you want to make it illegal for a company like Cloudflare to use open source licensed software with similar warranty disclaimers, or for the SLA agreements and penalties for violating them that they make with their own paying customers to be legally unenforceable? What if I just have a personal website and I break the javascript on it because I was careless, how should that be legally treated?

I'm not against research into more reliable software or using better engineering techniques that result in more reliable software. What I'm concerned about is the regulatory regime - in other words, what software it is or is not legal to write or sell for money - and how to properly incentivize software service providers to use techniques that result in more reliable software without causing a bunch of bad second order effects.


I absolutely do not mind, yes.

You can't go out in the middle of your city, build a shoddy bridge, say you wave all responsibilities and then wash your hands with the consequences when it predictably breaks. Why can you do that with pieces of software?

Limiting the scope of liability waivers is not the same things as censoring what software can be produced. It's just ensuring that everyone actually take responsibility for the things they distribute.

As I said previously, the current situation doesn't make sense to me. People have been brainwashed in believing that the way software is released currently, half finished and crippled with bugs, is somehow normal and acceptable. It absolutely doesn't have to be this way.

It'a beyond shameful that the average developers today is blissfully unaware of anything related to producing actually secure pieces of software. I am pretty sure I can walk into more than 90% of development shops today and no one there will know what formal methods are. With some luck, they might have some static analysers running, probably from a random provider and be happy with the crappy percentages that it outputs.

It's not about research. It's about a field which entirely refuses to become mature despite being pivotal to the modern economy. And why would it? Software products somehow get a free pass for the shit they push on everyone.

We are in the classical "market for lemons" trap where negative externalities are not priced in and investing in security will just get you to lose against companies that don't care. Every major incidents remind us we need out. The market has already showed it won't self correct. It's a classical case where regulatory intervention is necessary and legitimate.

The shift is already happening by the way. The EU product liability directive was adopted in 2024 and the transition period ends in December 2026. The US "National Cybersecurity Strategy" signals intend to review the status quo. It's coming faster that people realise.


I find myself in the odd position of agreeing with you both.

That we’re even having this discussion is a major step forward. That we’re still having this discussion is a depressing testament to how slow slowly the mainstream has adopted better ideas.


I agree with you. But onsidering nobody learns any real engineering in software; myself solidly included; this is still an improvement.

But yes, I wish I had learned more, and somehow stumbled upon all the good stuff, or be taught at university about at least what Rust achieves today.

I think it has to be noted Rust still allows performance with the safety it provides. So that's something maybe.


> I can’t wait for the complete meltdown when they discover effect systems in 2040

Zig is undergoing this meltdown. Shame it's not memory safe. You can only get so far in developing programming wisdom before Eternal September kicks in and we're back to re-learning all the lessons of history as punishment for the youthful hubris that plagues this profession.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: