May 6, 2024

The mountain of shit theory

Uriel Fanelli's blog in English

Fediverse

Oh, fastly.

Oh, fastly.

In the casino that happened this morning I was involved in a smear because one of the two projects where I am located also manages LSRs (if the backbones are the internet highways, the LSRs are the toll booth, and the BNG is the junction).

Before we get to the point, there are two things to say.

Fastly will hopefully publish a so-called "postmortem", ie a complete analysis of what happened to the network, and why a huge number of ASs have suddenly disappeared from the routing tables.

Only they can do it, because only they have all the tracks and logs to do it. What we have seen from the outside allows only hypotheses.

It is said in the newspapers that "a CDN went out". But a CDN does not "shut down", it is too distributed: it is a kind of content logistics system, which brings the most requested content in a certain area to a data center in that area. As does Amazon, which puts pasta in warehouses in Italy because they consume more of it, and beer in German warehouses. It serves to minimize costs.

I was saying, a CDN does not "turn off". But the problem is since its job is to move the data close to you, a CDN has to be very smart at calculating distances and geometry. "Distances and geometries" in the jargon of the internet are "routing", managed through a "thing" called BGP.

So how does “routing” work? Let's start from the simple one. Imagine having to go from Bologna to Milan. Then you could arrive at the Casalecchio di Reno toll booth and ask. That one will tell you “go towards Modena Sud”. Then you arrive in “Modena Sud”, and ask again. And they tell you “Go to Modena Nord and ask”. And so on, up to Milan. The toll booths are the border routers, and inside there is a little man who every time takes the card, asks you where you are going and calculates the route (no, the IP packets do not have a navigator).

You may, at some point, have a reflector. Let's say you work at a toll booth and you have a hard time calculating the route for everyone, and like you, all the toll booths from Parma to Milan decide to call their friend Gianpiernaik in Milan. And all the toll booths line up with that.

In this case, the friend from Milan is called "BGP reflector". Why is it comfortable? Because if the road changes, for any reason, we just need to inform Gianpiernaik, that is, the reflector, instead of informing EVERY fucking tollbooth after Parma.

Well.

Now imagine that Gianpiernaik alone can't make it, and KarenDeborah (I use the Milanese names at random) gives him a hand to inform everyone. The condition for this to work is that Gianpiernaik and KarenDeborah are always at work (at least one of the two) and that they give the same indications. In that case we have reflector clusters.

Well. There are two things you will have guessed.

If all toll booths call your friend and your friend, things go wrong in case:

  1. where your friend and your friend go their separate ways.
  2. in case your friend and your friend say to everyone “and now go FUCK YOU! I hate you! _next: 0.0.0.0! DIE! ".

Number two could be the equivalent of what is called a "blackhole", which is the case where packets are told to die on the spot, and even badly.

Here, since a CDN must be VERY good at deciding where to move what, and therefore at calculating routes, it will make excessive use of these techniques. The trouble is that, seen from the outside, it seemed that for a moment the whole Fastly network ended up in blackhole: no one knew how the hell to get there, and when asked the answer was "_next: CREPASULPOSTOEPUREMALE".

This, mind you, does not mean that you can tell from the outside what happened. Maybe this was an effect of another failure. So you have to wait for Fastly's postmortem.

What I'm saying is that the effect we have seen is what you see when a cluster of reflectors goes to hell. But I DON'T KNOW if this is what happened.

The question you will ask is: suppose a reflection problem is the problem, but how easy is it to obtain such an effect? Is human error enough? The answer is'…. bad news.

A human error is enough.

Since a human error can be enough in any field, let's be clear: BGP has the small defect of PROPAGING problems. It has the nice feature of PROPAGING useful information. But if you put garbage in it, the motto "garbage in input, garbage in output" does not apply. The motto “garbage in input, it snows garbage from here to Betelgeuse” is worth more.

Well. So you know that before you get your hands on BGP, you make sure there are TWO people behind the console, (if you manage very large networks), you have meetings that ITIL would call CAB, and all. It is true that resilience has increased over time, but if an actor like Fastly does a chapel on a cluster of reflectors, the rest of the world KNOWS IT.

So the CDN didn't shut down. Apparently, no one knew how to go over the border: it was as if it had been possible to get to Parma, and from there the toll booths didn't know what to do.

And since the CDN also runs DNS, they've been blackholes too. And with them the servers on the edge. And so on, in a chain, and it snows garbage up to Betelgeuse.

And here we are at the problem.

The Internet was born as a decentralized system. Things like BGP, as long as we had border routers (the former motorway toll booths) capable of holding the entire map in order to calculate the route from A to B, any A and B, were enough to keep everything under control.

But now, that is, in 2021, we have a problem: there are entities that (as in the case of Google) make 30% of the backbone traffic by themselves. And the double-digit slices of traffic belong more or less to all GAFAMs.

And so yes, if for example in Facebook or Google they made a hat with BGP, they could really bring down significant portions of the Internet.

Also because, as you might imagine, if all the routes to Milan disappear, the problem is not only for those who want to GO to Milan, but also all those who just wanted to pass there will have to recalculate the routes. So if a good piece of network goes down, the rest follows, and the domino effect is not to be excluded: even if everything does not come down, there would still be time to turn off all the oscillations (in our example motorists who decide to take the Turin tollbooth and then realize that the Turin reflector also whacks them, and decide to return to Milan, etc.). Or, if the Milan reflector decides to send EVERYONE to Quarto Oggiato, traffic jams could occur: Quarto Oggiaro cannot handle all the traffic in Milan.

The centralization of the Internet in a few beastly weight actors IS A PROBLEM. And it won't get better in the future.

It's not going to get better because all these actors keep getting more traffic. What's the problem?

The problem is that on the internet we want to pass fintech, IoT, connected cars, remote medicine, remote WORK.

We're putting the glassware and the bull in the same room. What could possibly go wrong?

When you say this in "my" environment, the answers are incredibly boorish.

  • This stuff is getting dangerous. Do you really want to put smart cities on this stuff?
  • But hollow, edge computing exists, we put the things you need close to you. So if google goes to mevda, you get your bank account anyway.
  • And you know what the fuck to do with the internet then, if with edge computing you bring me the bank in the Central office? If the Internet is only guaranteed for things close to me, I better walk into the fucking bank office!
  • But edge computing is cool. It is cool. And if something is cool, the solution MUST be. Let's go through the Centval Office, and the fuck gets longer.
  • But look, a CDN is just an edge computing system, and they are the ones that are having the most frequent problems! How does the problem become the solution?
  • But we are beautiful.

I could list a whole series of "remedies" that are supposed to work in a world that depends on the internet, the next similar collapse, and every time we would have a guy who can not configure the home router, in the chair, to explain that no, put the bull in the same room where you have the crystals is a wise thing, as long as the silverware sticks in your ass.

Personally, I am skeptical that filling a CO with servers is useful, (also because they would still be offband), but there is always the nineteen-year-old "cool guru" who explains to you that P4 will solve every problem and smartNICs are the solution, anytime, anywhere. (coincidentally, "cool guru" sells P4 and SmartNIC chips, but it is a coincidence).

But the point is this: as you noticed today, putting the bull in the same room where you keep the glassware is not a wise move.

And if the bull is the one in charge of polishing your Bohemia, as happens with the CDN, in my opinion we are looking for trouble, and sooner or later we will find ourselves discussing what to do, URGENTLY, after an accident big enough to REALLY hurt. .

There are TOO MANY actors capable of shitting substantial chunks of the Internet. And the only certainty we have is "eh, but those are system engineers with balls". True, but they are human. (although at times, I admit, they look like Vogon). And if you do a lot of things, you make a lot of mistakes. only those who do nothing make no mistakes. Those do a lot of things.

I repeat. I can only say that the CDN didn't "shut down", it just appeared as if no one knew how the hell to reach it, after the servers on the edge. It was there, the cars were on, but no one knew how the hell to get there. It's as if everyone knew how to get to Lombardy, to Parma, and then… boh.

It is not said, I repeat to the nausea, that this is the "root cause" of what happened, that is the "root cause". We have to wait for the post-mortem analysis of what happened.

And here is the second problem: these companies have NO interest in telling the truth. Often they make ambiguous and defensive communications to explain what happened: after almost 10 years we are still waiting to know what the hell is a “storage storm”, a phenomenon with which Amazon has “explained” an outage that lasted almost three days on AWS.

So it's not even said we'll ever know exactly what happened.

And this is another consequence of the fact that not only are there too many “systemic” actors, but that these actors are not transparent and are not obliged to be.

Under these conditions, we are building for ourselves the problems that will come. We are literally working on the construction of the wall on which we will hit, sooner or later, the face.

But as far as I can see, no one is asking the problem. Today's motto is “cool before of important”: first we think about Tinder for Cats, and about routing… then we think about it.

Leave a Reply

Your email address will not be published. Required fields are marked *