> you're going to have multiple outages
us: 0, aws: 1. Looking good so far ;)
> AND incur more cross-internet costs
hetzner have no bandwidth traffic limit (only speed) on the machine, we can go nuts.
I understand you point wrt the cloud, but I spend as much time debugging/building a cloud deployment (atlas :eyes: ) as I do a self-hosted solution. Aws gives you all the tools to build a super reliable data store, but many people just chuck something on us-east-1 and go. There's you single point of failure.
Given we're constructing a many-node decentralised system, self-hosted actually makes more sense for us because we've already had to become familiar enough to create a many-node system for our primary product.
When/if we have a situation where we need high data availability I would strongly consider the cloud, but in the situations where you can deal with a bit of downtime you're massively saving over cloud offerings.
We'll post a 6-month and 1-year follow-up to update the scoreboard above
> many people just chuck something on us-east-1 and go
Even dropping something on a single EC2 node in us-east-1 (or at Google Cloud) is going to be more reliable over time than a single dedicated machine elsewhere.
This is because they run with a layer that will e.g. live migrate your running apps in case of hardware failures.
The failure modes of dedicated are quite different than those of the modern hyperscaler clouds.
It's not an apples-to-apples comparison, because EC2 and Google Cloud have ephemeral disk - persistent disk is an add-on, which is implemented with a complex and frequently changing distributed storage system
On the other hand, a Hetzner machine I just rented came with Linux software RAID enabled (md devices in the kernel)
---
I'm not aware of any comparisons, but I'd like to see see some
It's not straightforward, and it's not obvious the cloud is more reliable
The cloud introduces many other single points of failure, by virtue of being more complex
e.g. human administration failure, with the Unisuper incident
Of course, dedicated hardware could have a similar type of failure, but I think the simplicity means there is less variety in the errors.
e.g. A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable - Leslie Lamport
I just wish there was a way to underscore this more and more. Complex systems fail in complex ways. Sadly, for many programmers, the thrill or ego boost that comes with solving/managing complex problems lets us believe complex is better than simple.
One side effect of devops over the last 10-15yrs I've noticed as dev and ops converged is that infrastructure complexity exploded as the old school pessimistic sysadmin culture of simplicity and stability gave way to a much more optimistic dev culture. Also better tooling also enabled increased complexity in a self fulfilling feedback loop as more complexity also demanded better tooling.
Anecdotal, but a year ago we lost the whole RAID array in a rented Hetzner server to some hardware failure.
In a way, I think it doesn't matter what you use as long as you diversify enough (and have lots of backups), as everything can fail, and often the probability of failure doesn't even matter that much as any failure can be one too many.
Lets host it all with 2 companies instead and see how it goes.
Anyway random things you will encounter:
Azure doesn't work because frontdoor has issues (again, and again)
A webapp in Azure just randomly stops working, its not live migrated by any means, restarts don't work. Okay lets change SKU, change it back, oop its on a different baremetal cluster and now it works again. Sure there'll be some setup (read, upsell) that'll prevent such failures from reaching customers, but there is just simply no magic to any of this.
Really wish people would stop dreaming up reasons that hyperscalars are somehow magical places where issues don't happen and everything is perfect if you justtt increase the complexity a little bit more the next time around.
Hardware failures on server hardware at the scale of 1 machine are far less common than us-east-1 downtime
The typical failure mode of AWS is much better. Half the internet is down, so you just point at that and wait for everything to come back, and your instances just keep running. If you have one server you have to do the troubleshooting and recovery work. But you need to run more than one machine to get fewer nines of reliability
> Hardware failures on server hardware at the scale of 1 machine are far less common than us-east-1 downtime
A couple pieces of gentle pushback here:
- if you chose a hyperscaler, you should use their (often one-click) geographic redundancy & failover.
- All of the hyperscalers have more than one AZ. Specifically, there's no reason for any AWS customer to locate all/any* of their resources in us-east-1. (I actively recommend against this.)
* - Except for the small number of services only available in us-east-1, obviously.
Hetzner also offers more than one datacenter, which you should obviously use if you want geographic redundancy. But the comment I was replying was saying "Even dropping something on a single EC2 node in us-east-1", and for a single EC2 node in us-east-1 none of the things you are mentioning are possible without violating the premise
Thanks for sharing the story and committing to a 6-month and 1 year follow up. We will definitely be interested to hear further how it went over time.
In the mean time, I am curious where the time was spent debugging and building Atlas deployments? It certainly isn't the cheapest option, but it has been quite a '1 click' solution for us.
I’m curious about the resilience bit. Are you planning on some sort of active-active setup with mongo? I found it difficult on AWS to even do active-passive (i guess that was docdb), since programatically changing the primary write node instance was kind of a pain when failing over to a new region.
Going into any depth with mongo mostly taught me to just stick with postgres.
> AND incur more cross-internet costs hetzner have no bandwidth traffic limit (only speed) on the machine, we can go nuts.
I understand you point wrt the cloud, but I spend as much time debugging/building a cloud deployment (atlas :eyes: ) as I do a self-hosted solution. Aws gives you all the tools to build a super reliable data store, but many people just chuck something on us-east-1 and go. There's you single point of failure.
Given we're constructing a many-node decentralised system, self-hosted actually makes more sense for us because we've already had to become familiar enough to create a many-node system for our primary product.
When/if we have a situation where we need high data availability I would strongly consider the cloud, but in the situations where you can deal with a bit of downtime you're massively saving over cloud offerings.
We'll post a 6-month and 1-year follow-up to update the scoreboard above