Understanding how Facebook disappeared from the internet

PaulRobinson · on Oct 5, 2021

It's been 20+ years since I had reason to know the internals of BGP, but I still carry a slight nervousness whenever I think about it.

In many ways, it's beautiful in that it allows Autonomous Systems to come together and independently make the Internet.

However, I always felt it was brittle in that it relied every AS to have edge routers aware of the entire AS routing map, and so when those routers went down hard, when they came back up they were blank and needed to learn the Internet again. I noted back then that on some routers a single spoofed UDP packet to a router getting a table from its peers could cause it to stop doing that and start again. In a time when those updates could take hours... well... that kept me awake at night sometimes.

Propagation of routes was global and error prone. Even today, you can see on the spy glass websites rogue AS routers advertising crazy routes either by accident or on purpose as a hostile act (to try and get traffic for a target network through itself).

There are tens of thousands of engineers globally nursing and managing this stuff to keep the whole thing going.

Like I said, it's been a while since I last looked closely, and I imagine multiple improvements have been made in this space since that time, but if there was ever a protocol that needs a long, cold hard look for a replacement, it's quite possibly BGP.

It's also the one - and only - non-crypto situation where if somebody asked me if a blockchain could be really very useful, I'd say "probably, yeah".

nroets · on Oct 5, 2021

So much of the internet is about translating addresses from one format to another:

Human readable to IP: DNS

Private IP to Public IP: NAT

MAC to Private IP: ARP

Public IP to AS: BGP

With a block chain DNS it may be possible to replace all of them with a single protocol, but it may well turn out to be an academic exercise.

grlass · on Oct 5, 2021

Namecoin [1] was an early fork of bitcoin, that did blockchain based DNS, with the ideas being actively discussed by Satoshi Nakamoto [2].

That deals with DNS (though how much has it been used?), but what about the layers of the stack?

[1] https://en.wikipedia.org/wiki/Namecoin

[2] https://bitcointalk.org/index.php?topic=1790.222

dgb23 · on Oct 5, 2021

These are abstraction barriers that isolate locations, layers and protocols from each other. I don't understand this as translation but rather as a separation of concerns.

oblio · on Oct 5, 2021

> So much of the internet is about translating addresses from one format to another

Most of computing is about mapping A to B and maintaining databases of those mappings :-)

WorldMaker · on Oct 5, 2021

As the saying goes there are two hard problems in computing: naming things, cache invalidation, and buffer overflows.

Most of these systems are about naming things and cache invalidation.

RustyConsul · on Oct 6, 2021

but if it's base zero, you still wouldn't get a buffer overflow ;)

WorldMaker · on Oct 6, 2021

Unless it needs to be a null pointer terminated array.

PaulRobinson · on Oct 5, 2021

Thanks for not pushing back too hard on my (elderly?), ignorance, and giving me a chance to catch up with the current state of the art. This looks very interesting, and I am looking forward to spending time diving in for no more than a sense of curiosity. Thank you.

alfalfasprout · on Oct 5, 2021

Problem is the blockchain stuff usually ends up being computationally expensive for what it's trying to do.

I could maybe see a use case for DNS but then either each computer has to have the full blockchain locally or rely on providers to do that for them in which case we're basically at the same model we have now (with root servers and DNS servers).

Obviously you don't want to publish private IPs, etc. to other computers... so inside a local network you're not buying yourself much.

For routing tables I could see more of a use case for blockchain technology since you can actually verify the routing tables and prevent BGP spoofing, etc.

collinmanderson · on Oct 6, 2021

I believe IPv6 intended to replace both ARP/MAC and NAT, but has mostly failed to achieve that.

vlan0 · on Oct 5, 2021

>Like I said, it's been a while since I last looked closely, and I imagine multiple improvements have been made in this space since that time

For anyone that's interested in what these may be. Take a peak at MANRS. https://www.manrs.org/

throwdecro · on Oct 4, 2021

This is a great write-up, but one thing I don't understand is why the effect of withdrawing the BGP prefixes was instantaneous (if I understand that correctly), but it's taking hours (so far) to re-announce the prefixes. Why would it take so long to flip the switch back the other way?

gorgoiler · on Oct 4, 2021

At a guess: reconnecting traffic at the billions-of-people scale has the potential for finding all sorts of weird behaviours. For all we know, it has been connected and disconnected 10x already during the outage, with each reconnect overwhelming some new, deeper level of the system each time.

Reconnect and the GLBs fall apart under load as the entire world’s cadre of recursive resolvers hit you.

Fix that. Reconnect again. This time your LBs have marked half the servers as offline because their heartbeats have been failing.

Fix that. Reconnect again. Now all the memcache data is hours old and so the site business logic fetches straight from databases, knocking them over.

Fix the databases. Reconnect again. Ad nauseam.

Godel_unicode · on Oct 5, 2021

I find this kind of uninformed conjecture amusing on a thread full of people complaining about cloudflare doing the same (they didn't). There's no evidence of this kind of flapping behavior in any of the telemetry I've seen posted by network engineer friends, and the blog post explicitly calls out when they saw the BGP updates that brought the site back online.

atf104 · on Oct 6, 2021

For me, the hardest part to believe is there was literally no one on-site at their datacenters. Really? No one? At this scale, literally, there has to be a security guard there who can kick the door open.

acomjean · on Oct 5, 2021

Reminds me of when I was working at a startup we DOSed ourselves. We had devices that monitored power minute by minute. A load balancer was mis configured and we went down for a some hours. We came back up and the devices all saw that and flooded us with all the data they’d been storing since we were down.. down again… bring it up and down again…. we needed a better fix. I was up till 2 am with the other developer coding a fix.. the next morning we talked to the CEO (CTO was on vacation)who told us upgrading the was database was going to be too expensive… good times. We did get a firmware fix ( to be honest we were running out of cash..)

I can’t imagine trying to restart something as big as facebook…

heavenlyblue · on Oct 5, 2021

Why couldn’t you just statistically drop traffic?

cjrp · on Oct 5, 2021

God bless exponential backoff!

nimbius · on Oct 4, 2021

Unpopular opinion but I think a talent exodus and turnover are likely at fault here as well. The people who stood up this juggernaut are no longer here, and Facebooks ability to consistently attract talent that is competent enough to maintain such a formidable beast is hampered by repetitive revelations that it operates at the net-loss of humanity as a whole.

Facebooks no longer an innovator, just a mining operation with a dwindling population of hateful elderly and bots.

tjpnz · on Oct 5, 2021

I'm sure the people who go to work for Facebook are fully aware of what they're getting into.

goatlover · on Oct 5, 2021

Isn’t Facebook huge in some Asian countries? Supposedly 3.5 billion people use one or more of its services. WhatsApp, Instagram and Oculus certainly aren’t just used by the elderly.

numpad0 · on Oct 5, 2021

Depends on which specific country you mean by Asia, as nations are kind of culturally segregated by sea and by heritage in Asia - I mean, Instagram is but Facebook itself is smol in Japan[1] and Twitter is bigger unlike anywhere[2] (taken from page 110 and page 170 of "The Digital 2021 Global Overview Report" from we are social/Hootsuite[3]).

1: https://image.slidesharecdn.com/datareportal20210308gd001dig...

2: https://image.slidesharecdn.com/datareportal20210308gd001dig...

3: https://wearesocial.com/blog/2021/01/digital-2021-the-latest...

thorin · on Oct 5, 2021

Anecdotally around 90% of the people I know in the UK (40s age group ) have Facebook and use it for something (even if it's just for checking restaurants/gigs etc), almost everyone uses Messenger or Whatsapp. Also for those saying they use Facetime, stats suggest that half phones in the UK are Android. FB is relevant, even if you want to be, maybe in the same ways Windows/Microsoft is relevant for a whole load of people.

josephg · on Oct 5, 2021

Sounds plausible, but how does that change the ethical calculus of working there?

bsedlm · on Oct 5, 2021

I'd say the same thing about Google.

erhk · on Oct 5, 2021

Yes well lets have this outage again in that flavour in say... 11 months?

bhawks · on Oct 5, 2021

Taking a bet that complex systems will fail is usually free money :).

Having been a part of Google the only thing more awe inspiring than the sheer complexity of production is the fact that it all worked so well.

This is not a dig at current Googlers but entropy is cruel and uncaring. Perhaps parts of the stack which have been kept fit & fresh in people's minds due to constant rewrites will last longer but there are tons of places in the depot that are unowned despite serving production query traffic and the number of engineers that have any context to support it grows smaller over time.

cornel_io · on Oct 5, 2021

Google (at least the good parts) has a pretty good culture of documentation, which helps a bit. While it drove me batty to need to spend a month writing up 10 page specs and getting sign off from directors for a minor feature that nobody would see and would take 3 days to code up, it was nice to be able to trace through the historical evolution of abandoned features when trying to figure out how they worked.

No idea if Facebook is similar.

dec0dedab0de · on Oct 4, 2021

I haven't been following closely, but I think once they moved the prefixes they could no longer access the routers. Coupled with barebones staff at the data center due to the pandemic, and all internal communication being disrupted. Though I really expected it to be up within an hour or two.

rifkiamil · on Oct 4, 2021

We have had out-of-band management ports & networks design for decades! I know the feeling of driving 8 hours because I lost connection to the device I was configuring. https://en.wikipedia.org/wiki/Out-of-band_management

alfalfasprout · on Oct 5, 2021

This. And if you're really worried about it you can go crazy with security with individually issued hardware security tokens and one time use access tokens.

It's pretty inexcusable that FB wasn't able to use OOB management.

allarm · on Oct 5, 2021

It’s likely that their oobm systems are depending on dns in some way - aaa for instance. It’s not possible that they don’t have oobm at all, really.

ampdepolymerase · on Oct 4, 2021

HN engineers believe out of band admin control planes to be surveillance and backdoor firmware so they are disabled for privacy reasons.

jrochkind1 · on Oct 4, 2021

Who the hell put HN engineers are in charge of facebook, and why haven't we gotten more out of it than a temporary outage??

ampdepolymerase · on Oct 5, 2021

Previously they had physical access to the data centers and weren't locked out.

dr_orpheus · on Oct 4, 2021

Yeah, I think that is true. If you look at the Update near the end of the Cloudflare article there is a huge spike in the BGP activity (I assume re-announcing all of the routes). So that part of it was relatively instantaneous after they got all of their ducks in a row actually getting to the routers and locating the BGP from some earlier version before it went offline this morning that they could use.

huuuz · on Oct 4, 2021

Reduced staff because of the pandemic? 18 months in? This looks like a sweet deal for those workers more than anything else.

OJFord · on Oct 4, 2021

Presumably they're working remotely, but the incident required physical access, so only the reduced number of people present were (immediately (or at all, I don't know their policies)) available to deal with it.

wyldfire · on Oct 5, 2021

I read that on-site badge readers denied access to employees! (Not necessarily the critical team to resolving the outage, but still...)

Hikikomori · on Oct 4, 2021

Restoring is just as simple as flipping the switch again, but access to that switch is another matter when your internal network is also down and you cannot even get access to your office or datacenters.

foobarian · on Oct 4, 2021

What I don't get is why there is no dead-man switch mechanism in place to roll back the configuration automatically unless someone confirms it positively. Kind of how screen resolution rolls back if you don't ack it. I used to always run a "(sleep 600; iptables -F) &" when messing with remote personal stuff just in case I lock myself out.

I suppose with something like BGP it would be very difficult to get such a fallback working given how distributed the system is, and even more difficult to keep it exercised and tested.

kelp · on Oct 5, 2021

This is a key feature of Junos on Juniper devices. It's called 'commit confirmed' and it will apply the config and then auto-rollback if you don't confirm it within a certain amount of time.

https://www.juniper.net/documentation/us/en/software/junos/c...

But Facebook's network is certainly much more complex and automated than just doing one commit on one device.

But I think they do still uses Juniper devices at the edge.

bigiain · on Oct 5, 2021

> But Facebook's network is certainly much more complex and automated than just doing one commit on one device.

<snark> Software defined networking. In PHP.

allarm · on Oct 5, 2021

Cisco have the rollback feature as well. It’s not implemented as smoothly as in Juniper though.

pepoluan · on Oct 5, 2021

Use `iptables-apply` instead.

iptables-apply reverts to previous config instead of flushing all the rules.

bagels · on Oct 5, 2021

How long do you wait?

How do you know if you waited long enough to see negative effects (think caches, your own and caches of others).

Waiting too long with a bad config also costs you.

cyanydeez · on Oct 4, 2021

engineers will tell you that not everything is reversible even if theres no specific cqpacity issue

adamcharnock · on Oct 4, 2021

I’m pretty new to BGP, but I’d imagine that cutting off access to an AS is fast because all it takes is for the neighbouring routers update their routes. At which point any traffic that makes it that far is simply dropped.

Whereas to make an announcement, the entire internet (or at least all routers between the AS and the user) need to pickup the new announcement.

(Note: I still need to read the article)

bsedlm · on Oct 4, 2021

(I'm trying to better understand this)

I think it's not so simple because authoritative DNS systems are involved.

So it's not just a BGP error. It's a BGP error which disconnected authoritative DNS for all facebook. I'm not quite sure why that makes it so slow to fix. is it just because internal difficulties due to having no DNS at all?

Hikikomori · on Oct 4, 2021

Once they start adversiting again it should only take a few minutes at most for most ISPs to get it.

p4bl0 · on Oct 5, 2021

But then it is the DNS that had to propagate from the now accessible authoritative servers.

Hikikomori · on Oct 5, 2021

Shouldn't take long once they are up and responding again?

WorldMaker · on Oct 5, 2021

I'd assume it is a cache invalidation problem at that point: from my lay understanding BGP probably needs to busts caches on a withdrawal to keep traffic from going to black holes and prevent DDoS attempts, but probably has to wait for TTL timeouts to cache new routes (and those TTLs are going to vary by whatever cache systems the other ASes are running not the timing of Facebook's AS sending the new [old] routes.).

Hikikomori · on Oct 5, 2021

BGP doesn't really cache routes, though you can configure routers to hold a route for some time if a peer times out, but this is only used in special cases and usually not something you would want your router to do with routes to other networks. If a router gets a withdrawal for a route it will remove the route from the table without waiting. This is a feature and important part of how BGP acts in case of problems and how it can self heal quickly, there's no point in caching a route when that path is not working anymore and usually there's a backup path.

kube-system · on Oct 4, 2021

Given my experience with DNS issues, I am guessing that they are running into dependencies along the way that assume/require DNS be available to function.

bink · on Oct 4, 2021

With routing it's even worse than that. If they had no out-of-band method to connect to these routers and they botched the routing config then they had no way to route any traffic to them at all. At least with DNS you can still connect to the IPs.

I would find it a bit surprising if Facebook didn't have OOB access to their data centers, however.

rifkiamil · on Oct 4, 2021

I'm sure they got stuck in a security loop. To get access OOB passwords or IP address, they needed to get into a password vault that is under facebook AN.

Next time FB save you passwords in OneDrive and Google Drive as a backup LOL. facebook-oob-password@gmail.com

ajsnigrutin · on Oct 4, 2021

Security policies are a pain in cases like this...

Laptop + mobile tethering + serial cable to the router + teamvewier for the remote admin to get the access solves problems like this in minutes.

Breaking a gajillion security policies by doing that is a different story though.

skhr0680 · on Oct 5, 2021

Don’t you think that losing thousands, millions, or billions of dollars is a pain?

withinboredom · on Oct 4, 2021

Assuming you don’t need DNS to get authorization to enter the OOB access…

theshadowknows · on Oct 4, 2021

Yeah as part of my job I often have to work with our DNS team to provision say a subdomain or get some domain verified. They’ve got like…three people…trying to service thousands of teams across the enterprise. I do not envy their job at all.

EricE · on Oct 5, 2021

Your guys need BP Diamond IP http://www.diamondipam.com

It's an IP address management (IPAM) solution that also just happens to be a fantastic, federated (if you want) DNS management system too. Indeed a previous org I worked at bought it strictly to tame the DNS beast - local sys admins could control DNS for their subnets but not affect anything else. If we wanted, we could have had approval processes on top of the change requests - the system supported that too.

I think the security teams finally woke up to the IP address management functionality and were slowly starting to integrate that into the rest of the infrastructure - but I was leaving around then. It was a fantastic system. One of the best hierarchical role-based access control systems in an application I have ever seen; the granularity was amazing yet it was easy to understand/administer. Not an easy trick!

Wolfspirit · on Oct 4, 2021

I'm not sure if that is true (and I hope it is not cause that would be fatal) but I read somewhere that with facebook being down also means all internal infrastructure of facebook isn't available at the moment (chats, communication) including remote control tools for the BGP Routers. Therefor they require people to get physical access to the router while many people are working from home cause of the pandemic.

rifkiamil · on Oct 4, 2021

Thats why we have patten called https://en.wikipedia.org/wiki/Out-of-band_management

dirkg · on Oct 5, 2021

also people's badges stopped working and they were unable to get physical access to sites.

belorn · on Oct 5, 2021

It just a guess, but from experience with BGP and associated redundancy systems that most likely was in place, if everything doesn't return immediately then you have a big fight on your hand to not only stop the non-functional redundancy but also reestablish the peer connections with associted hearthbeat/processes for establishing and maintaining the peer connection. My understanding from what people write about configuring BGP and the system around it seems to imply that the best practice in this circumstance is to kill everything, fix the original error and then turn on things slowly again. Then fix the broken redundancy configuration. Then test the redundancy system regularly in the future.

termau · on Oct 4, 2021

I feel like it just confuses the issue with a bunch of unnecessary babble about DNS, theres better ways to read about how BGP works without confusing a bunch of different things. The only part of the article that was relevant was 'Routes were withdrawn' - the rest being a consequence of that.

cortesoft · on Oct 5, 2021

You have to be careful turning something as large as Facebook back on. If you turn on announcements one place first, the entire internet will try to reach you through a single transit and overwhelm it.

dijit · on Oct 5, 2021

The kind of tail you’re talking about is baked into DNS at least.

I don’t know enough about BGP to make an informed decision; but at the point the outage is noticed it’s entirely possible that the system has been unavailable for quite some time already.

foolfoolz · on Oct 5, 2021

fast off board slow onboard is a pattern you find all over. it has roots in fraud but there’s many reasons. getting more access is a privilege escalation and requires some trust to achieve

PeterCorless · on Oct 4, 2021

If an authoritative DNS entry was removed, it can take up to 72 hours for that change to be propagated around the world, though usually just a few hours for some other authoritative DNS systems to get you mostly back:

https://ns1.com/resources/dns-propagation#:~:text=DNS%20prop....

tester756 · on Oct 4, 2021

Why it takes this long?

withinboredom · on Oct 4, 2021

Caching

mnordhoff · on Oct 5, 2021

Resolvers typically cache successful "does not exist" responses for no more than 1-3 hours. (And authoritative servers often have a lower negative TTL.)

(There's a corner case related to DNSSEC that can make it go higher, but that's being worked on, and isn't relevant here.)

In this situation, the nameservers were just down. I haven't done exhaustive research, but the resolvers I'm aware of cache that kind of thing for no more than 15 minutes.

withinboredom · on Oct 5, 2021

If there’s a chain of caches 3 deep, a 15 minute cache on bad responses will take 45 minutes to clear.

codedeadlock · on Oct 5, 2021

How can you explain yesterday's outage (Facebook, Instagram, WhatsApp) to your parents?

You are feeling hungry and went to food court. The food court (open area) has a lot of options. You sit down in front of Domino's (Facebook), since you want to eat garlic bread. Now, you can't order from the counter directly. The waiter will come to your seat and ask for the order. You ordered garlic bread from the waiter, but the guy at Domino's counter went missing. Your order was not reaching to the chef in kitchen as Domino's counter guy was not present.

This explains why Domino's (Facebook) ecosystem was down, but what about other vendors? They had nothing to do with Facebook.

To understand this, we need to go back to our food court again. Now, there are a lot of hungry people sitting outside Domino. Since they were not getting answer from one waiter as why their food is not on their table, they started disturbing all the waiters. Due to this, majority of the waiters were trying to figure out where the Domino's counter guy went and other food joints (read websites) were not able to fulfil their own orders.

So although only Domino's was down, it appeared as if whole Food Court (Internet) was facing issues.

Counter Guy at Domino's - Facebook Nameservers Waiters - DNS Servers (Cloudflare, Google, Akamai)

Philip-J-Fry · on Oct 5, 2021

I'd just say that you had an address for Facebook in your address book. The page somehow vanished and you don't know their address any more. So you start phoning other people and knocking on their door to try and find what their address is. Everyone else is doing this and no one knows what their address is. So you've got millions of people phoning each other and knocking on doors.

Facebook being down was already an issue, but everyone phoning and knocking on doors was causing disruption to everyone else.

srcmap · on Oct 5, 2021

If your parent is startrek fan, tell him "Commander Data" transmitted a "sleep" command to BGP (Borg Gateway Protocol) collective.

As result, the borg collectives can access to network, even 7of9 can't enter the campus.

miked85 · on Oct 5, 2021

I think that this may be more confusing than the actual explanation.

codedeadlock · on Oct 5, 2021

Thanks for the feedback. I will simplify it further!

5560675260 · on Oct 5, 2021

There is no need to make it any more complex than "facebook, the company, messed up, now their properties are broken". An overly elaborated analogy just makes you sound condescending.

samhw · on Oct 5, 2021

He's trying to explain the knock-on effect on other websites. I don't see how this is at all condescending.

BFG-One · on Oct 5, 2021

don't think the parents (target audience) would think he sounds condescending.

Moru · on Oct 5, 2021

Just use the industry standard car analogy instead :-)

rvnx · on Oct 5, 2021

Another way to see the situation:

Internet is just a bunch of computers interconnected via tons of cables (hence the name; "inter-networked computers").

To be reachable, every equipment and computer constantly need to tell the others about their existence (to publicly announce on which network cable they can be reached at).

Facebook engineers wanted to optimise that system but accidentally broke it during the update.

As a consequence, after a few minutes, other computers didn't know on which cables they can reach Facebook.

Facebook had to call the technicians sitting in the datacenter to cancel the last change that was done (because the Facebook engineers couldn't themselves connect from the office) and everything was fine again.

ccakes · on Oct 5, 2021

You need to call Facebook to find out what your friends have been up to but their number has temporarily disappeared from the phone book.

phtrivier · on Oct 5, 2021

That would have been accurate for a DNS outage ; but with my layman understanding of BGP, I would say the analogy would be something between "...but their phone line is broken" and "...but they disappeared from the phone book because they don't have a phone line any more" .

Is that right ?

pitkali · on Oct 5, 2021

Is that an interesting distinction for the target audience, though?

WorldMaker · on Oct 5, 2021

Actually, it probably is, especially if you dial the analogy back a couple decades before the "We're sorry that number has been disconnected" automated responses: Facebook's phone line went down and when you call the Operator even if you have the phone number, they can't connect you, but this is weird and you aren't the only one trying to call Facebook so now they are calling in other Operators to diagnose the problem because surely someone has heard from Facebook recently.

That analogy includes the snowball impact on the other websites and services as the Switchboard Operators get more over-utilized into puzzling out Facebook's problem than servicing calls for still working phone numbers.

philposting · on Oct 5, 2021

I explained it as Facebook the city still existing, but they'd taken down the signposts.

juanuys · on Oct 5, 2021

How about this:

Mum, dad - you know how I always tell you to turn your stuff off and on again?

Well, by turning it off, Facebook also turned off the On Button.

corobo · on Oct 5, 2021

Analogies serve to confuse half the audience and make the other half go "actually it's more like .."

BFG-One · on Oct 5, 2021

cool, but were did mark zuckerberg (dominos chef) go? he felt dissed so he ran and hid?

ruoso · on Oct 4, 2021

> ... but as of 22:28 UTC Facebook appears to be ...

Someone assumed London==UTC, when London is 1 hour ahead :) that was actually 21:28 UTC

interestica · on Oct 4, 2021

No matter what time of year it is, people tend to use 'EST' for 'Eastern Time' even when we might be in Eastern Daylight Time rather than Standard.

It's especially annoying when dealing with multiple countries that may or may not be using Daylight Saving Time.

Wolfspirit · on Oct 4, 2021

Even google isn't quite sure about the summer time. Not sure if that is just a Google German thing...

A few weeks ago I tried to find out what the current time in CET is. Asking google for "CET" gave me: "23:27 CET". Asking google for "CET time" (I know that "time" is twice in this case) gave me "00:27 CET".

The last one is wrong and should be CEST or even more correct would be just the same result for CET as I asked for

imglorp · on Oct 4, 2021

I am always forgetting the polarity. Seeing the cut-overs listed helps me:

    $ zdump -v Europe/Berlin | grep 2021
    Europe/Berlin  Sun Mar 28 00:59:59 2021 UT = Sun Mar 28 01:59:59 2021 CET isdst=0 gmtoff=3600
    Europe/Berlin  Sun Mar 28 01:00:00 2021 UT = Sun Mar 28 03:00:00 2021 CEST isdst=1 gmtoff=7200
    Europe/Berlin  Sun Oct 31 00:59:59 2021 UT = Sun Oct 31 02:59:59 2021 CEST isdst=1 gmtoff=7200
    Europe/Berlin  Sun Oct 31 01:00:00 2021 UT = Sun Oct 31 02:00:00 2021 CET isdst=0 gmtoff=3600

Jamie9912 · on Oct 5, 2021

In this case, Google is providing a wrong answer to not confuse people. People don't understand that a time zone can exist but not be observed at the same time

delecti · on Oct 5, 2021

I make a point to give local times in just "PT". Not least among my reasons is that I can't ever remember which half of the year is Daylight time and which is Standard.

Wolfspirit · on Oct 4, 2021

Timezones are the most annoying thing... right after encoding

paxys · on Oct 4, 2021

I personally find timezones more annoying. At least with encoding once you figure things out it will work indefinitely. Timezones can simply change from under you with or without notice.

ethbr0 · on Oct 5, 2021

You think timezones are simple, and then you realize this, when you try to implement them (in the US): https://en.wikipedia.org/wiki/Time_in_the_United_States#Boun...

(And that's not counting daylight savings time, and its varied observance!)

remram · on Oct 5, 2021

Unicode 14 just came out. If you are using collation, NFD, ... you need to update your libs.

doublerabbit · on Oct 4, 2021

> right after encoding

No joke. Today I ended up writing a whole essay explaining the issue I was having and almost sending it off to the core developers because I thought I had discovered an issue with the actual language. The bug was because I had forgot to convert too&from utf-8 in these two procedures:

    proc 2Hex { input } { binary encode hex [encoding convertto utf-8 "$input"] }
    ;# Converts base32 string data to base16

    proc 2Base { input } { encoding convertfrom utf-8 [binary decode hex "$input"] } 
    ;# Converts string hex data to base32

On the plus side, I now have written documentation of the internals of my program.

VBprogrammer · on Oct 4, 2021

Oh TCL. I didn't miss you.

gcbirzan · on Oct 4, 2021

Actually, the article seems to confuse the times quite a lot. It's talking about ~16:50 UTC at points, but the outage started at 15:40, which they not only mention in the article but you can also see on the graphs.

alksjdalkj · on Oct 4, 2021

Question about the WARP map - I assume the grey countries are places where Cloudflare doesn't have any presence, but what about Egypt/Oman? Why are they green? And why is Australia orange and not red?

jgrahamc · on Oct 5, 2021

We didn't have enough data from those locations.

ormax3 · on Oct 5, 2021

Egypt has over 100 million people. Facebook/WhatsApp is quite popular in that part of the world, so I would image it had significant traffic..

rswail · on Oct 5, 2021

Because most of the outage happened in the early hours of the morning AU time, so not that many people online.

davidjytang · on Oct 5, 2021

Also curious about the green part.

jgrahamc · on Oct 5, 2021

It's an anomaly. I think that was right on the edge of "not enough traffic to report on".

WaitWaitWha · on Oct 4, 2021

This provides a good set of details (mostly educational) what happened up to but not including the how and why the BGP routes were withdrawn (who sent the UPDATE packets to the neighboring ASes?).

The most "natural" occurrence that I can think of is best-path change. If a "better" route between AS is added, the now-second-best routes are withdrawn.

Correct me if I am wrong, but there is no way of determining the source of the withdraw message (UPDATE)...

asadjb · on Oct 5, 2021

That info. would probably come from FB though, not Cloudflare. At least that's how I understand the incident.

WaitWaitWha · on Oct 5, 2021

> "Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. "

and

> "... its root cause was a faulty configuration change on our end."

From https://engineering.fb.com/2021/10/04/networking-traffic/out...

WaitWaitWha · on Oct 5, 2021

Love to hear the reason for the downs.

zanethomas · on Oct 4, 2021

All very amusing. But most amusing was that facebook employees' badges wouldn't work. Hilarious. Can't get in the office if facebook.com is not reachable. (or whatever the unreachable service was)

jiocrag · on Oct 5, 2021

Far fetched, obviously, but.... Jurassic Park, Dennis Nedry-type situation? Shut down the master systems to hide the theft (or in this case, destruction) of vital info.

No key-card access means even non-need-to-know internal employees wouldn't see the deed, and plausible deniability is spawned for everyone else.

TrispusAttucks · on Oct 5, 2021

Considering they just had a whistle blower leak incriminating documents... I'd say this is plausible.

Edit: Also a great way to suppress the news of the whistleblower. Take the site down so the outage outranks news searches keyword "Facebook"

fighterpilot · on Oct 5, 2021

The second explanation seems plausible to me. The first one is a bit farfetched since they could've just waited two weeks in order to make it look less suspicious given that the leak would not be front of mind for everyone by then.

kevin_thibedeau · on Oct 5, 2021

"I'm sorry Senator. We lost that data in the Oct 4 incident."

HenryKissinger · on Oct 5, 2021

"The Senate will decide your fate, Mr. Zuckerberg."

jbigelow76 · on Oct 5, 2021

Geez I certainly hope not, you think that corrupt bunch of bastards would ever hold a billionaire accountable for anything?

unityByFreedom · on Oct 5, 2021

Ah, so no playing oculus either?

jeremyis · on Oct 5, 2021

My friends said they couldn't play Oculus

DonHopkins · on Oct 5, 2021

I read on Facebook that Nicki Minaj's Cousin's Friend was using an Oculus during the Facebook outage, and he got trapped in CyberSpace, and when his balls became swollen in CyberSpace, they also became swollen in real life!

hef19898 · on Oct 5, 2021

That's what you get when you try to get 100% in a Starfleet drill assessment simulation. A rigged one at that!

drzaiusapelord · on Oct 5, 2021

Oculus customers were only able to play offline games during this time. Which I find concerning. I don't mind that FB employee badges didn't work because what would they do at their desks anyway and all exit doors have bypasses anyway so they can, and were, open, just rank and file employees couldn't access some spaces. Some of those spaces are secure and "fail closed" by design. Yes, its inconvenient but unlocking all the doors to potential terrorists or workplace shooters if the internet connection gets snipped isn't what you want in an office.

What's scary is that my gaming machine can't be used because of some weird obsession with centralizing oculus through facebook. As a customer that's inexcusable but now imagine a developer who hosts her own servers for her game and her customers can't get to it. Facebook is only providing middle-man authentication, its not even hosting these games. If they were hosted by FB it would be fine, but they aren't and what we're suffering under is FB being the gatekeeper to the rest of the internet. That is a scary prospect.

I also question why we all think its acceptable to have these incredible BGP outages every 3-6 months. We built our digital world on the equivalent of ever changing spinning plates and manage it in the most cost-efficient way possible. Maybe the alternative would been worse, but its crazy to me as to what we consider normal and acceptable in capitalist culture.

I'm already seeing /r/oculus say its no different than Valve's scheduled maintenance windows, which is obviously false. Its incredible what rationalizations we'll accept instead of questioning the status quo of capitalist culture and our giant corporations that rule so much of our lives.

Bluecobra · on Oct 5, 2021

> I also question why we all think its acceptable to have these incredible BGP outages every 3-6 months.

This isn’t BGP’s fault, someone or something presumably made a network configuration change that took down their route advertisements to the world. There’s currently ~72K autonomous systems advertising ~900K IPv4 prefixes today on the Internet. There’s bound to be some sort of screw up once in a while.

cube00 · on Oct 5, 2021

There's been a few people so far whose Facebook accounts were banned that learnt this the hard way.

robbedpeter · on Oct 5, 2021

What's it like paying to rent a gaming headset from Facebook?

14 · on Oct 5, 2021

Curious about this as well could not find anything on initial searches but I honestly hope so. The one reason I have not grabbed the oculus is because the Facebook requirement.

crmd · on Oct 5, 2021

When I hear Facebook leadership talking about the VR vision, I’m reminded of IBM in the early 2000s encouraging employees to claim their second life avatars and setup cyber customer briefing centers.

paul7986 · on Oct 5, 2021

Exactly ... like I am going to strap on a headset and work with my co-workers memojis for 8 hours or even less. Laughable and VR has been around since the 90s with the same form factor .. it hasn't evolved as it will always be strap this bulky thing to your head & isolate yourself from the world. Maybe it will catch on for videogames but to co-work in a virtually reality world HA.

AR Glasses at least their form factor has changed since Google Glass and will continue to evolve some. Google Glasses to how Facebook Stories sunglasses look which look like sunglasses billions are use to wearing and use daily. Billions will never strap on a headset for hours to co-work. Billions will adopt AR Glasses as it takes a familiar form factor/daily life product and enhances it like the iPHone enhance our daily lives.

Tostino · on Oct 5, 2021

Eh, I could see myself doing development in that type of environment if the tools were get better. I've worked remote for the past 6 years, so it wouldn't quite be the same as in an office environment though.

ksaj · on Oct 5, 2021

It wouldn't be the first time it's happened, and you would expect their network and security folk to understand single points of failure.

buryat · on Oct 5, 2021

you would learn not to trust a first message you read on hackernews about what's going on inside

swayvil · on Oct 5, 2021

A single point of failure implies a single point of control. A dungeonmaster. There can be only one.

R0b0t1 · on Oct 5, 2021

DND driven development or backdoor? You decide!

But, seriously, it could use a separated network with federation like an AD domain.

edoceo · on Oct 5, 2021

That was Highlander tho.

ffwszgf · on Oct 5, 2021

You would think those in charge of security would know better with their mid-six figure salaries. But then again these days they stopped hiring based on merit but rather certain other “metrics” no wonder they were so incompetent...

heromal · on Oct 5, 2021

Which metrics are they using to hire E6s and up? Because those are the only ones making 500k+

markus_zhang · on Oct 5, 2021

Can you elaborate on this? Curious about how they hire their 500k people.

mikeyouse · on Oct 5, 2021

The person you're replying to is insinuating that Facebook hires half-a-million dollar engineers based on diversity quotas instead of technical chops. No sense asking them to elaborate since they're obviously completely outside the company and have no worthwhile insight into their hiring practices.

rnd0 · on Oct 5, 2021

"Diversity quota" is one interpretation, mine was that they meant FB hires were determined according to who you know and who you blow. (probably less of the latter post-#metoo)

BFG-One · on Oct 5, 2021

was not them. was an intern.

ksaj · on Oct 6, 2021

The intern wasn't the single point of failure. They just triggered the failure. This should have been obvious at the design stage and pre-mitigated.

And if they don't mitigate it now, every hacker with access into their network now knows how to bring the whole show down in an instant.

mnordhoff · on Oct 4, 2021

"Because of this Cloudflare’s 1.1.1.1 DNS resolver could no longer respond to queries asking for the IP address of facebook.com or instagram.com."

The instagram.com zone itself uses a third-party DNS service and didn't go down. (But e.g. www.instagram.com is a CNAME to a zone on FB DNS.)

yftsui · on Oct 4, 2021

That’s pretty much why during the downtime visit instagram.com showed a 503 from AWS instead.

mschuster91 · on Oct 4, 2021

Wonder why they're still using AWS given that FB operates its own data centers...

mnordhoff · on Oct 4, 2021

No idea. I'd speculate that it's some kind of historical reasons from before FB acquired IG.

hinkley · on Oct 5, 2021

Could be that nobody thought to ask. If it was an oversight that'll be rectified shortly, now that everyone in the company knows about it.

hansel_der · on Oct 5, 2021

probably a testimony to the cost of aws lock-in

jgrahamc · on Oct 5, 2021

Thanks. I'll correct that.

nkssy · on Oct 5, 2021

Is there a petition to make this permanent? Asking for a friend

The article itself was a good exploration of the impact of BGP and what happens when network advertisements stop and the associated network disappears in a puff of global forgetfulness.

Might be time to poke around BGP as well. A lab setup might be a good toy.

robocat · on Oct 4, 2021

Also see Krebs’ article on the outage (although he hasn’t yet updated it after the problem was rectified): https://krebsonsecurity.com/2021/10/what-happened-to-faceboo...

belter · on Oct 5, 2021

Lets assume somebody messed up their BGP config. Does anybody know if they use Juniper? Why they would not do the changes via "the commit confirmed" option?

The idea is you do something, you messed up and are locked out, the system will revert to previous config after a few minutes by itself. The story of people physically accessing systems only makes sense, if this was a hack...

===============

Commit Confirmed

"Suppose that despite all your efforts to insure your new configuration is correct before you commit it, something is overlooked and when you commit, you are locked out of the router."

"Rather than just a simple commit, you can make the candidate configuration active with a commit confirmed command. With this command, the router waits 10 minutes for a second commit. If it does not receive that confirming command within those 10 minutes, the router automatically does a rollback and commit so that the previous configuration becomes active again."

https://www.networkworld.com/article/2345600/managing-a-juno...

"Committing a Configuration"

https://www.juniper.net/documentation/us/en/software/junos/c...

===============

lokar · on Oct 5, 2021

This and other stories seem to assume that a BGP issue is the root cause. That may be, but there are other possibilities.

Some other internal issues could have broken many systems, including BGP.

They could have had some sort of internal systems failure, and intentionally withdrew BGB to cut off the flood of connection attempts (making recovery easier).

Many possibilities.

Ansil849 · on Oct 4, 2021

Is anyone else kinda put off by how Cloudflare keeps interjecting themselves into this situation? First with the Twitter posts and now this blogpost. They're a passive observer, not involved in this event, and yet they keep putting themselves in the middle of it.

Cloudflare staff are not Facebook staff, they do not know how things actually went down to trigger the events that transpired. They are essentially doing glorified bikeshedding, and providing an explainer of how BGP works that can be gleaned by reading virtually any introductory text on it, and using it as a marketing opportunity.

Lazare · on Oct 5, 2021

> They are essentially doing glorified bikeshedding, and providing an explainer of how BGP works that can be gleaned by reading virtually any introductory text on it, and using it as a marketing opportunity.

I mean, I'm not really sure what the purpose of a corporate blog is except that? You make posts about whatever will garner attention in order to get some views, maybe spread some information, and - of course - turn it into a marketing opportunity. That's the job, no?

> Is anyone else kinda put off by how Cloudflare keeps interjecting themselves into this situation?

Personally? No. Although amusingly, a non-technical friend of mine took the twitter posts they made as a sign that Cloudflare had caused the outage somehow, so it's certainly possible there's a risk there.

And from one of your other comments:

> Next time Cloudflare's CDN eats itself and starts vomiting up private customer data, Facebook can do a blogpost titled 'Understanding how Cloudflare exposed the private information of untold numbers of its customers'.

Yes, they should totally do that (at least if they have anything informative to contribute, as I think Cloudflare does here). Why would this be a bad thing, or a reason not to talk about Facebook's issues? And I mean, at one point today something like 6+ of the top 10 links on HN were about Facebook, so I mean, everyone else is talking about them. Why not Cloudflare? And if and when Cloudflare has their next big issue, everyone will be talking about them either way.

sudhirj · on Oct 5, 2021

The CEO recently pointed out on Twitter that the primary purpose of the company blog was hiring, and that the company would write with that in mind. The post taught me something I didn’t already know, and left me more impressed than I was before with Cloudflare. Mission accomplished.

missedthecue · on Oct 4, 2021

I mean... they're in the business of keeping websites online. It's natural that they document these events both for their own research and to market their product.

rjvir · on Oct 4, 2021

It's merely an informative blog post on a topic many people are interested in. I see nothing wrong with Cloudflare just explaining what happened, even though they had nothing to do with it.

cortesoft · on Oct 5, 2021

It is little stuff like this that makes it come off a bit self-aggrandizing:

> We keep track of all the BGP updates and announcements we see in our global network. At our scale, the data we collect gives us a view of how the Internet is connected and where the traffic is meant to flow from and to everywhere on the planet.

> Fortunately, 1.1.1.1 was built to be Free, Private, Fast (as the independent DNS monitor DNSPerf can attest), and scalable, and we were able to keep servicing our users with minimal impact.

It isn’t a big deal, and the posts are still interesting. It just makes me roll my eyes a bit.

dgb23 · on Oct 5, 2021

It's a proud and confident type of marketing. A bit on the nose.

I very much prefer that over the almost patronizing, overly friendly tone some others have, or the stripped of any personality style that most have.

Ansil849 · on Oct 4, 2021

> I see nothing wrong with Cloudflare just explaining what happened, even though they had nothing to do with it.

You kinda inadvertently highlighted the issue: because they had nothing to do with it, they do not know what actually happened. They can pontificate about likely causes, just like others in the industry can, but they have no idea what actually caused the issue.

bikeshaving · on Oct 5, 2021

At no point in the blog post did they offer any conjecture about what was happening at Facebook. All of their information was general descriptions of DNS and BGP, or descriptions of how the Facebook outage was experienced on their end from running a DNS resolver. That in and of itself makes for an interesting and informative perspective.

dfdz · on Oct 5, 2021

I assume you did not read the blog post? It’s just a technical post describing the outage from Cloudflare’s perspective and mostly focuses on the increased traffic to 1.1.1.1 and the latency it caused

You can pontificate about likely contents of Cloudfare’s blog post, just like others who did not read it, but clearly you have no idea what it actually contains

Godel_unicode · on Oct 5, 2021

If you read the blog post, you'll see that it's speculation-free facts about what happened. BGP announcements happened at time t, DNS started failing at t+n, DNS requests spiked, BGP updates happened at t', DNS returned to normal at t'+n.

Lazare · on Oct 5, 2021

I think your criticism is unfair.

> They can pontificate about likely causes

They can, but they didn't.

rattray · on Oct 5, 2021

They don't claim to; they had useful information to share and shared it in a way that was helpful and informative to the lay engineer.

hinkley · on Oct 5, 2021

They've done this in previous outages and people didn't like it then, either.

Generally big companies don't talk smack about other big companies having internet-wide issues, unless those issues are directly caused by the other company.

For instance, when Google talked about Cloudbleed, which was when Cloudflare vomited millions of secrets all over Google's caching heirarchy and Google had to manually clean it up.

I think perhaps the Cloudflare people have gotten confused and think that means it's okay to talk about other people's stuff. Instead of interpreting it as it really is, which is that Cloudflare is the last company that should be criticizing everyone else, lest someone bring up their previous missteps.

Godel_unicode · on Oct 5, 2021

Given that warp is a pretty major player in the VPN space and lots of their customers are likely to blame them for not being able to get to Facebook, I think having a detailed "wasn't us" on their blog that their sales engineers can point to is reasonable.

jgrahamc · on Oct 5, 2021

which is that Cloudflare is the last company that should be criticizing everyone else

There is no criticism of Facebook in our blog post.

Ansil849 · on Oct 5, 2021

I still remember Cloudflare's PR efforts to downplay Cloudbleed. A key reason why I meticulously avoid using any of their services.

BFG-One · on Oct 5, 2021

NikolaNovak · on Oct 5, 2021

I kinda see your perspective; but I also mostly view them similarly to e.g. BackBlaze blogs - sure it can be viewed as taking smack about Samsung and Seagate drives failing, but I seem them as "things fail; what can we learn from it / here's our observations". And yes a little bit of self-advertising, but again, most tech blogs are, one way or another.

Ansil849 · on Oct 5, 2021

It just comes off as crass. Next time Cloudflare's CDN eats itself and starts vomiting up private customer data, Facebook can do a blogpost titled 'Understanding how Cloudflare exposed the private information of untold numbers of its customers'.

notriddle · on Oct 5, 2021

Sure they can. What’s CloudFlare going to do? Try to stop them?

Though I would expect something like this from Fastly more than Facebook.

NikolaNovak · on Oct 5, 2021

Right, but as a consumer if that information, I might find that article equally interesting.

fareesh · on Oct 4, 2021

As someone who is always interested in learning new things, their blog post is informative and helpful. Not everything is about optics, their intentions are of no interest to me, because I got some value out of the content.

dirkg · on Oct 5, 2021

how are they 'interjecting' themselves? They run a massively popular dns service as well as many other sites, and they are simply reporting on what happened and why people who use their dns etc might have had outages. In addition to being a very detailed and well written article on the whole incident. Not really sure what you are driving at.

Ansil849 · on Oct 5, 2021

> Not really sure what you are driving at.

At the fact that they are presenting themselves as authorities on the incident, when they have no internal knowledge of what triggered the events, because they are not Facebook engineers. They provide an explanation of BGP that you can glean by reading virtually any other introductory explainer, and turn it into a chance to promote their own service.

MrStonedOne · on Oct 5, 2021

>At the fact that they are presenting themselves as authorities on the incident, when they have no internal knowledge of what triggered the events, because they are not Facebook engineers. They provide an explanation of BGP that you can glean by reading virtually any other introductory explainer, and turn it into a chance to promote their own service.

This is not an honest representation of the article.

They talk about what their services observed related to bgp traffic from facebook. They are an authority on that.

They talk about their dns traffic changes from facebook's outage. They are an authority on that.

They talk about suspected causes, based on the observable data, and guess what, given they are who they are, this is something they are a subject matter expert on.

"we saw a spike in bgp traffic from facebook followed by a bunch of route withdrawals. we think this could be a bgp configuration issue [given we took large chunks of the internet down 2 years via the same fuckup]"

Is something in the subject matter wheelhouse of cloudflare, yes.

joebob42 · on Oct 5, 2021

Ironically, I read this and thought "I like this and these people are doing the sort of things I like doing, maybe I should think about working with them if and when I need another job."

Sample size of 1, but I imagine that's more or less the reaction they're hoping for.

MrStonedOne · on Oct 5, 2021

They talk about what their services observed related to bgp traffic from facebook. They are an authority on that.

They talk about their dns traffic changes from facebook's outage. They are an authority on that.

_aj8o · on Oct 5, 2021

So ... cloudflare has inserted themselves far more deeply into the discussion awhile back.

With the new DoH Chrome setting turned on by default (DNS-over-HTTPS), AFAIK they are now the default resolver bypassing your network settings.

[1] https://developers.cloudflare.com/1.1.1.1/encrypted-dns/dns-...

[2] https://duo.com/decipher/google-makes-dns-over-https-default...

yholio · on Oct 5, 2021

The whole blog post seemed like a click and bait to me.

Nothing is "explained" other than what we knew already, that some unlucky SOB shot themselves in the head with a BGP shotgun.

I understand, they have no way of knowing what happened inside Facebook. But they could give some detail about exactly what the BGP updates were, the structure of the IP space served etc.

DantesKite · on Oct 5, 2021

No. I don’t mind at all. It’s a great marketing opportunity and they would be stupid not to take advantage of it.

Ansil849 · on Oct 5, 2021

Totally. Just like it's weird that more cigarette companies don't advertise outside schools.

IntelMiner · on Oct 5, 2021

Nobody ever died for using Cloudflare

That's kind of incredibly disrespectful to even compare the two

unityByFreedom · on Oct 5, 2021

Not at all. It's interesting analysis. It's possible to bloviate in such a post but I don't see it here.

samhwr · on Oct 4, 2021

It’s massively viral news about the exact stuff they specialise in. I’m not too surprised that they’d be putting out content. After all, Facebook is hardly in a position to do so ;)

Ansil849 · on Oct 4, 2021

It's not surprising, no, but it is irritating because it's just crass marketing.

gcorrao · on Oct 5, 2021

I thought of this at first, but after reading FB official "tech" explanation on the outage (https://engineering.fb.com/2021/10/04/networking-traffic/out...) I very much prefer Cloudflare's take on this.

nancyp · on Oct 5, 2021

I'dnt expect fb employees to write up about BGP while they are in middle of a fire drill.

BFG-One · on Oct 5, 2021

or an actual fire?

wslack · on Oct 4, 2021

They seem to be treating it as a marketing opportunity, as well as explaining these topics for journalists. If their post didn't add to the public conversation, it wouldn't have been so viral.

prawn · on Oct 5, 2021

How'd you feel about Krebs' post about the outage?

https://krebsonsecurity.com/2021/10/what-happened-to-faceboo...