Bit more technical than usual, and not directly about my job, so consider this an informational tale first and foremost. If you wonder why your internet has been acting up in North America in the last 24 hours,
here's a graphical explanation.
The
Border Gateway Protocol entries in the
Forwarding Information Base finally hit 512K. Yes, 512K... That's tiny, and the fact this caused major network issues in north America shows the fragility of the network caused by overuse of legacy hardware and software too many people never bothered to upgrade or maintain, even when the issues they would cause could be predicted a decade ahead of time. In short, it was an artificial problem because a bunch of people waited till it was broke to preempt the problem.
For how long was this predictable?
This graph shows a predictable growth pattern since 2002. Of course there are slight variations of a day to day basis, but there are
plenty of public tools demonstrating this was going to be a thing any day now. We raised the issue at the last three senior staff meetings as we were beginning to flirt with the red line, but for once, it really wasn't our company's fault. You can call major US partners and tell them about the fountain of wisdom, but you can't make them drink.
On randomly goggled forums, blogs and such, average people were predicting it was imminent for days or weeks.
Our Telco wasn't (for once) a guilty party in this long-predicted SNAFU, but we sure felt the ripple effect as many of our links to the US suffered for it. Internet calls waiting still in the triple digit as I type this, despite explanation messages. Networks is working overtime to minimize impact, but as always tech support bears the complaints until it gets fixed. Obviously only one frontline agent out of ten could vulgarize all this, assuming generously that their customers could understand a clear explanation. Senior staff lines' been red all day despite multiple emails and ticker updates all essentially saying 'Not our fault, our major partners forgot to invest in critical aspects of their infrastructure for over a decade, explain the problems nicely'.
While my job of repeatedly explaining the basics to contractors was slightly less dramatic than that of my colleagues at Networks today, (And less frustrating than it was for frontline), I wanted to share. Once s*** hit the fan, our partners reacted pretty quickly but it was definitely a few years late. It's much better now but we may still have some lingering effects for awhile. The fact that 512K limits on some live legacy hard/soft-ware are still a problem in 2014 is amazing all on it's own. Bill Gates' infamous 640K quote isn't as irrelevant as we hoped yet
I assume others who were manning phones today have their stories to share about the effects this has had on their customers. Sysadmins surely have more perspective to share too. For me, once clear instructions were out, my day mostly consisted of explaining to employees repeatedly what they were reading and what it meant. At least for once, I can't even blame anyone in our company - hard to fix some big majors' unwillingness to ensure their Methuselian s*** is minimally able to respond to demand until all hell breaks lose.