> you're going to have multiple outages us: 0, aws: 1. Looking good so far ;) > ...

runako · 2025-11-13T16:59:39 1763053179

> many people just chuck something on us-east-1 and go

Even dropping something on a single EC2 node in us-east-1 (or at Google Cloud) is going to be more reliable over time than a single dedicated machine elsewhere. This is because they run with a layer that will e.g. live migrate your running apps in case of hardware failures.

The failure modes of dedicated are quite different than those of the modern hyperscaler clouds.

chubot · 2025-11-13T18:27:00 1763058420

It's not an apples-to-apples comparison, because EC2 and Google Cloud have ephemeral disk - persistent disk is an add-on, which is implemented with a complex and frequently changing distributed storage system

On the other hand, a Hetzner machine I just rented came with Linux software RAID enabled (md devices in the kernel)

---

I'm not aware of any comparisons, but I'd like to see see some

It's not straightforward, and it's not obvious the cloud is more reliable

The cloud introduces many other single points of failure, by virtue of being more complex

e.g. human administration failure, with the Unisuper incident

https://news.ycombinator.com/item?id=40366867

https://arstechnica.com/gadgets/2024/05/google-cloud-acciden... - “Unprecedented” Google Cloud event wipes out customer account and its backups

Of course, dedicated hardware could have a similar type of failure, but I think the simplicity means there is less variety in the errors.

e.g. A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable - Leslie Lamport

travisgriggs · 2025-11-13T19:12:12 1763061132

> by virtue of being more complex

I just wish there was a way to underscore this more and more. Complex systems fail in complex ways. Sadly, for many programmers, the thrill or ego boost that comes with solving/managing complex problems lets us believe complex is better than simple.

antod · 2025-11-13T23:52:47 1763077967

One side effect of devops over the last 10-15yrs I've noticed as dev and ops converged is that infrastructure complexity exploded as the old school pessimistic sysadmin culture of simplicity and stability gave way to a much more optimistic dev culture. Also better tooling also enabled increased complexity in a self fulfilling feedback loop as more complexity also demanded better tooling.

It's kept me employed though...

Anonyneko · 2025-11-14T07:34:57 1763105697

Anecdotal, but a year ago we lost the whole RAID array in a rented Hetzner server to some hardware failure.

In a way, I think it doesn't matter what you use as long as you diversify enough (and have lots of backups), as everything can fail, and often the probability of failure doesn't even matter that much as any failure can be one too many.

jabwd · 2025-11-13T21:12:16 1763068336

The internet was designed to survive nukes.

Lets host it all with 2 companies instead and see how it goes.

Anyway random things you will encounter: Azure doesn't work because frontdoor has issues (again, and again) A webapp in Azure just randomly stops working, its not live migrated by any means, restarts don't work. Okay lets change SKU, change it back, oop its on a different baremetal cluster and now it works again. Sure there'll be some setup (read, upsell) that'll prevent such failures from reaching customers, but there is just simply no magic to any of this.

Really wish people would stop dreaming up reasons that hyperscalars are somehow magical places where issues don't happen and everything is perfect if you justtt increase the complexity a little bit more the next time around.

wongarsu · 2025-11-13T23:20:13 1763076013

Hardware failures on server hardware at the scale of 1 machine are far less common than us-east-1 downtime

The typical failure mode of AWS is much better. Half the internet is down, so you just point at that and wait for everything to come back, and your instances just keep running. If you have one server you have to do the troubleshooting and recovery work. But you need to run more than one machine to get fewer nines of reliability

runako · 2025-11-14T03:47:35 1763092055

> Hardware failures on server hardware at the scale of 1 machine are far less common than us-east-1 downtime

A couple pieces of gentle pushback here:

- if you chose a hyperscaler, you should use their (often one-click) geographic redundancy & failover.

- All of the hyperscalers have more than one AZ. Specifically, there's no reason for any AWS customer to locate all/any* of their resources in us-east-1. (I actively recommend against this.)

* - Except for the small number of services only available in us-east-1, obviously.

wongarsu · 2025-11-14T09:20:33 1763112033

Hetzner also offers more than one datacenter, which you should obviously use if you want geographic redundancy. But the comment I was replying was saying "Even dropping something on a single EC2 node in us-east-1", and for a single EC2 node in us-east-1 none of the things you are mentioning are possible without violating the premise

MobileVet · 2025-11-13T16:55:52 1763052952

Thanks for sharing the story and committing to a 6-month and 1 year follow up. We will definitely be interested to hear further how it went over time.

In the mean time, I am curious where the time was spent debugging and building Atlas deployments? It certainly isn't the cheapest option, but it has been quite a '1 click' solution for us.

kdazzle · 2025-11-13T18:56:48 1763060208

I’m curious about the resilience bit. Are you planning on some sort of active-active setup with mongo? I found it difficult on AWS to even do active-passive (i guess that was docdb), since programatically changing the primary write node instance was kind of a pain when failing over to a new region.

Going into any depth with mongo mostly taught me to just stick with postgres.