More

jread · 2025-12-20T19:42:31 1766259751

This was true for RDS serverless v1 which scaled to 0 but is no longer offered. V2 requires a minimum 0.5 ACU hourly commit ($40+ /mo).

cobolcomesback · 2025-12-20T21:09:32 1766264972

V2 scales to zero as of last year.

https://aws.amazon.com/blogs/database/introducing-scaling-to...

It only scales down after a period of inactivity though - it’s not pay-per-request like other serverless offerings. DSQL looks to be more cost effective for small projects if you can deal with the deviations from Postgres.

jread · 2025-12-20T21:45:07 1766267107

Ah, good to know, I hadn't seen that V2 update. Looks like a min 5m inactivity to auto-pause (i.e., scale to 0), and any connection attempt (valid or not) resumes the DB.

jread · 2025-11-10T19:51:57 1762804317

As part of my CS grad research, I launched a website reporting on public cloud availability and performance.

https://cloudlooking.glass/#show=uptime

jread · 2025-10-30T18:49:45 1761850185

I'm working on graduate research evaluating AWS control and data plane performance.

EBS volume attachment is typically ~11s for GP2/GP3 and ~20-25s for other types.

1ms read / 5ms write latencies seem high for 4k blocks. IO1/IO2 is typically ~0.5ms RW, and GP2/GP3 ~0.6ms read and ~0.94ms write.

References: https://cloudlooking.glass/matrix/#aws.ebs.us-east-1--cp--at... https://cloudlooking.glass/matrix/#aws.ebs.*--dp--rand-*&aws...

0xbadcafebee · 2025-10-31T00:37:18 1761871038

You might want to add the bit from the blog about worst-case attach times to your research. From my own experience (though it was years ago), sometimes an EBS volume would fail and simply never return. Definitely won't be acceptable for some use cases

jread · 2025-10-31T01:18:37 1761873517

Yes, we've been testing volume attachments every 5m since start of the year, and have experienced 100-150 attachment failures per volume type in that time frame during multiple events (most recently last week):

https://cloudlooking.glass/dashboard/#aws.ebs.us-east-1--cp-...

Another interesting bit, is last March AWS changed something in the control plane which both triggered a multi-day LSE, and ultimately increased attachment times from 2-3s to 10-20s (also visible in the graphs).

jread · 2025-10-20T21:08:53 1760994533

Lambda create-function control plane operations are still failing with InternalError for us - other services have recovered (Lambda, SNS, SQS, EFS, EBS, and CloudFront). Cloud availability is the subject of my CS grad research, I wrote a quick post summarizing the event timeline and blast radius as I've observed it from testing in multiple AWS test accounts: https://www.linkedin.com/pulse/analyzing-aws-us-east-1-outag...

jread · 2025-08-14T22:38:57 1755211137

For networking, the site only reports uptime % for zonal, regional, cross-region or cross-cloud tests. It excludes last mile network tests as those fail frequently due to many hops and endpoint unreliability (we use Ripe Atlas and Globalping.io endpoints which are not always reliable even with redundant probes per test).

jread · 2025-08-14T22:27:11 1755210431

For data and control plane I can determine the issue from the API request/response logs (i.e., network timeout, 5xx, etc.). Network tests are trickier and we don't have a great way to validate failure cause each of those events (i.e., we don't capture a traceroute on failure), other than to evaluate results from multiple endpoint combinations (e.g., AWS us-east-1 to us-west-1 fails while us-east-2 to us-west-1 succeeds).

jread · 2025-08-14T21:01:59 1755205319

Happy to answer any questions also.

jread · 2025-03-05T15:53:18 1741189998

$8549 with 1TB storage

rbanffy · 2025-03-05T18:46:56 1741200416

It can connect to external storage easily.

jread · on Dec 22, 2024

> Hardware can fail for all kinds of reasons

Complex cloud infra can also fail for all kinds of reasons, and they are often harder to troubleshoot than a hardware failure. My experience with server grade hardware in a reliable colo with a good uplink is it's generally an extremely reliable combination.

0xbadcafebee · on Dec 23, 2024

And my experience is the opposite, on both counts. I guess it's moot because two anecdotes cancel each other out?

Cloud VMs fail from either the instance itself not coming back online, or an EBS failure, or some other az-wide or region-wide failure that affects networking or control plane. It's very rare, but I have seen it happen - twice, across more than a thousand AWS accounts in 10 years. But even when it does happen, you can just spin up a new instance, restoring from a snapshot or backup. It's ridiculously easier to recover than dealing with an on-prem hardware failure, and actually reliable, as there's always capacity [I guess barring GPU-heavy instances].

"Server grade hardware in a reliable colo with good uplink" literally failed on my company last week, went hard down, couldn't get it back up. Not only that server but the backup server too. 3 day outage for one of the company's biggest products. But I'm sure you'll claim my real world issue is somehow invalid. If we had just been "more perfect", used "better hardware", "a better colo", or had "better people", nothing bad would have happened.

jread · on Dec 23, 2024

There is lot of statistical and empirical data on this topic - MTBF estimates from vendors (typically 100k - 1m+ hours), Backblaze and Google drive failure data (~1-2% annual failure rate), IEEE and others. With N+1 redundancy (backup servers/RAID + spare drives) and proper design and change control processes, operational failures should be very rare.

With cloud hardware issues are just the start - yes you MUST "plan for failure", leveraging load balancers, auto scaling, cloudwatch, and dozens of other proprietary dials and knobs. However, you must also consider control plane, quotas, capacity, IAM, spend, and other non-hardware breaking points.

You're autoscaling isn't working - is the AZ out of capacity, did you hit a quota limit, run out of IPv4s, or was an AMI inadvertently removed? Your instance is unable to write to S3 - is the metadata service being flakey (for your IAM role), or is it due to an IAM role / S3 policy change? Your Lambda function is failing - did it hit a timeout, or exhaust the (512MB) temp storage? Need help diagnosing an issue - what is your paid support tier - submit a ticket and we'll get back to you sometime in the 24 hours.

jread · on Dec 22, 2024

If have a global user base, depending on your workload, a simple CDN in front of your hardware can often go a long ways with minimal cost and complexity.

motorest · on Dec 22, 2024

> If have a global user base, depending on your workload, a simple CDN in front of your hardware can often go a long ways with minimal cost and complexity.

Let's squint hard enough to pretend a CDN does not qualify as "the cloud". That alone requires a lot of goodwill.

A CDN distributes read-only content. Any usecase that requires interacting with a service is automatically excluded.

So, no.

jread · on Dec 22, 2024

> Any usecase that requires interacting with a service is automatically excluded

This isn't correct. Many applications consist of a mix of static and dynamic content. Even dynamic content is often cacheable for a time. All of this can be served by a CDN (using TTLs) which is a much simpler and more cost effective solution than multi-region cloud infra, with the same performance benefits.