> Workers KV is in the process of being transitioned to significantly more resilient infrastructure for its central store: regrettably, we had a gap in coverage which was exposed during this incident.
I heard that it was a "mandatory dependency" to mitigate "insider risk" or something. There's no way it's going anywhere. Odds are they'll just enforce even slower rollouts "to catch things early".
Where did you see this information? Was it on a social media channel? I do see the IAM services in the list of affected services in the incident report.
The comment was self explanatory, and no, it wasn't a widespread GCP outage. Most everything was up except for GCS and firebase, and later on identity stuff started causing cascading issues but not when this was posted.
Incident affecting API Gateway, Agent Assist, AlloyDB for PostgreSQL, Apigee, Apigee Edge Private Cloud, Apigee Edge Public Cloud, Apigee Hybrid, Cloud Data Fusion, Cloud Firestore, Cloud Logging, Cloud Memorystore, Cloud Monitoring, Cloud Run, Cloud Security Command Center, Cloud Shell, Cloud Spanner, Cloud Workstations, Contact Center AI Platform, Contact Center Insights, Data Catalog, Database Migration Service, Dataform, Dataplex, Dataproc Metastore, Datastream, Dialogflow CX, Dialogflow ES, Google App Engine, Google BigQuery, Google Cloud Bigtable, Google Cloud Composer, Google Cloud Console, Google Cloud DNS, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Pub/Sub, Google Cloud SQL, Google Cloud Storage, Google Compute Engine, Identity Platform, Identity and Access Management, Looker Studio, Managed Service for Apache Kafka, Memorystore for Memcached, Memorystore for Redis, Memorystore for Redis Cluster, Persistent Disk, Personalized Service Health, Pub/Sub Lite, Speech-to-Text, Text-to-Speech, Vertex AI Search
Our entire infra in GCP stayed up just fine, we just couldn't manage anything. IDK what to tell you. Many of the things you list here were not down at all.
That it wasn’t down for you does not mean it wasn’t down for others or even almost everyone. Certainly, Google wouldn’t have listed the services as having an outage if nobody was impacted. You can’t extrapolate from “works for me” to “it must have been working for everyone”.
I don't understand your argument? Wasn't GCP's own status page calling them outages? Some of our upstream providers (who use GCP) were definitely affected and down.
As a former SRE there, is "widespread outage" a specific, special kind of classification that's not obvious to the public just by looking at the status page...? Or what do you mean?
I gotta say, it's kinda nice when that happens... work just kinda pauses for everyone, from providers to customers. It kinda feels like a national holiday, and everyone downstream from the affected cloud can just kinda sit back and relax cuz there's nothing they can do anyway except wait.
When it's your own outage, it's all-hands-on-deck panic mode. When it's half the internet down, it's no longer your problem, lol
I guess it depends on what your company's acceptable level of downtime is. If you're like Cloudflare (who handled this well), you take this as a sign to build fault tolerance around your 3rd party providers.
If your application is mission-critical, downtime is anything but a holiday.
Yep, KV is broken too. Any worker that depends on KV is throwing exceptions. I was able to get into the dash, but it's very slow. Error rates started to go up significantly around 18:00 UTC.
It's not much of a reach to go from "discussion about impact on human-verification dialogs" to 'discussion about human-verification dialog policy". This isn't an incident-management channel, it's a discussion forum - tangents are fine!
I complained in the apnews.com thread, because the apnews.com verification, which is annoying by itself, did not work at all this time. That is hardly unrelated.
Unrelated, they have a few services that rely on GCP which is down. Still, I imagine the people working on the maintenance for Tokyo turned white during that job worried it was caused by them...
I really do appreciate the transparency and ownership that comes with these. We all fuck up, but a lot of companies would rather hide their mistakes than own up to them. Cloudflare's approach makes me trust them more.
So both Cloudflare authentication as well as Google's identity systems suffered major dowtime yesterday. Are there technical dependecies between these?
Cloudflare doesn't say this directly but in their blog they've written
> The cause of this outage was due to a failure in the underlying storage infrastructure used by our Workers KV service, which is a critical dependency for many Cloudflare products and relied upon for configuration, authentication and asset delivery across the affected services. Part of this infrastructure is backed by a third-party cloud provider, which experienced an outage today and directly impacted availability of our KV service.
proxy seems available in general, must just be local to workers because only one of my sites going thru ZT tunnel with identity access rules is affected
So they depend on GCP for (some of) their services