Internet routing table breaks 512,000 routes

Francisco · Aug 13, 2014

Looks like with all the mass IP grabs as well as subletting of space, things have finally hit the 512K mark.

Anyone on the raw end of an upstream router that ran out of route space?

How do most routers handle route exhaustion? Does it still operate, just with a < full table? Or does it crap out? Or does it end up doing a lot of the routing in software? I've honestly

not worried about it too much since we always pick DC's based on their in house networks, not how

carrier neutral they are.

Francisco

sundaymouse · Aug 13, 2014

http://www.reddit.com/r/talesfromtechsupport/comments/2deb04/today_was_a_bad_day_and_every_isp_knew_it_was/

SGC-Hosting · Aug 13, 2014

We do IT work on Long Island (new york) near where we're based. Last week we had setup a wireless network for a company and they have their radius server in another country. Yesterday morning I get calls about it not working... They were able to reach the radius server... but the radius server wasn't able to reach the domain controllers for authentication! Went nuts trying to figure it out. Ended up setting up a separate network with temporary credentials so everyone can get online.

SkylarM · Aug 13, 2014

"Don't fix it till it's broke"

Anyone still think the IPv6 transition will happen anytime soon? Because this is just another prime example that it WON'T happen anytime soon

concerto49 · Aug 13, 2014

No issue here. Didn't want to take the risk so got a more expensive router with more routes in case.

wlanboy · Aug 13, 2014

sundaymouse said:
http://www.reddit.com/r/talesfromtechsupport/comments/2deb04/today_was_a_bad_day_and_every_isp_knew_it_was/

For the lazy ones:

Bit more technical than usual, and not directly about my job, so consider this an informational tale first and foremost. If you wonder why your internet has been acting up in North America in the last 24 hours, here's a graphical explanation.

The Border Gateway Protocol entries in the Forwarding Information Base finally hit 512K. Yes, 512K... That's tiny, and the fact this caused major network issues in north America shows the fragility of the network caused by overuse of legacy hardware and software too many people never bothered to upgrade or maintain, even when the issues they would cause could be predicted a decade ahead of time. In short, it was an artificial problem because a bunch of people waited till it was broke to preempt the problem.

For how long was this predictable? This graph shows a predictable growth pattern since 2002. Of course there are slight variations of a day to day basis, but there are plenty of public tools demonstrating this was going to be a thing any day now. We raised the issue at the last three senior staff meetings as we were beginning to flirt with the red line, but for once, it really wasn't our company's fault. You can call major US partners and tell them about the fountain of wisdom, but you can't make them drink.

On randomly goggled forums, blogs and such, average people were predicting it was imminent for days or weeks.

Our Telco wasn't (for once) a guilty party in this long-predicted SNAFU, but we sure felt the ripple effect as many of our links to the US suffered for it. Internet calls waiting still in the triple digit as I type this, despite explanation messages. Networks is working overtime to minimize impact, but as always tech support bears the complaints until it gets fixed. Obviously only one frontline agent out of ten could vulgarize all this, assuming generously that their customers could understand a clear explanation. Senior staff lines' been red all day despite multiple emails and ticker updates all essentially saying 'Not our fault, our major partners forgot to invest in critical aspects of their infrastructure for over a decade, explain the problems nicely'.

While my job of repeatedly explaining the basics to contractors was slightly less dramatic than that of my colleagues at Networks today, (And less frustrating than it was for frontline), I wanted to share. Once s*** hit the fan, our partners reacted pretty quickly but it was definitely a few years late. It's much better now but we may still have some lingering effects for awhile. The fact that 512K limits on some live legacy hard/soft-ware are still a problem in 2014 is amazing all on it's own. Bill Gates' infamous 640K quote isn't as irrelevant as we hoped yet

I assume others who were manning phones today have their stories to share about the effects this has had on their customers. Sysadmins surely have more perspective to share too. For me, once clear instructions were out, my day mostly consisted of explaining to employees repeatedly what they were reading and what it meant. At least for once, I can't even blame anyone in our company - hard to fix some big majors' unwillingness to ensure their Methuselian s*** is minimally able to respond to demand until all hell breaks lose.

Francisco · Aug 13, 2014

concerto49 said:
No issue here. Didn't want to take the risk so got a more expensive router with more routes in case.

Makes the most sense.

I'm assuming people with 512K TCAM's will likely just do partial routes + default.

Francisco

Mun · Aug 13, 2014

I think it happened around 8:06 PST

concerto49 · Aug 13, 2014

Mun said:
I think it happened around 8:06 PST

Think we had some abuse with the Fliphost server that's been taken care of. Definitely wasn't related. Not sure about the other stuff there.

nunim · Aug 13, 2014

Personally I think this is awesome, companies are going to have to replace some of their dated equipment which means increased IPv6 support (hopefully..)!

concerto49 · Aug 13, 2014

nunim said:
Personally I think this is awesome, companies are going to have to replace some of their dated equipment which means increased IPv6 support (hopefully..)!

Most likely routes will be consolidated / filtered.

Deleted · Aug 13, 2014

Most ciscos will flap because memory is exhausted. Only solution is to stop accepting /24's, which is what it should have been all along.

Wintereise · Aug 13, 2014

Francisco said:
Makes the most sense.

I'm assuming people with 512K TCAM's will likely just do partial routes + default.

Francisco

Filter at a /22 and you'll live another few years. TCAMs can also be adjusted on a few models that are 'virtually' limited to 512k.

But for hardware that genuinely can't address more than 512k, yeah -- bad time to be a netops guy.

dcdan · Aug 13, 2014

Got this yesterday:

An emergency maintenance window on the Phoenix NAP network will occur on Wednesday, August 13, 2014, from 2:00 a.m. to 4:00 a.m. Mountain Standard Time (MST). During this maintenance window, the networking team will upgrade our core devices to allow for more routes. This change will require a reload of each core device that may cause intermittent network communication of roughly 5- 10 minutes during this time window.

Our Network Engineering Team will be taking all necessary precautions to mitigate any prolonged connectivity issues during the maintenance window. If you have any questions, please contact NOC Services.

Thank you for your understanding and patience as we continue to work toward providing you with the best possible service.

... few hours later:

As we continue to see problems in our IP services network we will be pushing
this maintenance to 11:00PM MST. Our network engineers are making the final
preparations to take on this maintenance and minimize any potential downtime. If
you have any questions or concerns please feel free to contact us.

splitice · Aug 13, 2014

nginx.com and nginx.net had routing issues yesterday to "some US subnets". Hosted by (M5 Computer Security). Probably related.

VPSCorey · Aug 13, 2014

FRH stayed up

Most routers with these limits crapped their pants until the TCAM was adjusted and rebooted.

Wintereise · Aug 13, 2014

In all honesty, the problem isn't really 'fixable' by upgrading the limit as you go.

You can get away with filtering at /24s now and not care -- but that isn't going to be the case pretty soon, since partial allocations seem to really be on the horizon. What happens when (god forbid) we're forced to accept announces for /29s?

Vendors would probably have done better to invent something that could grow with the network, rather than plastering bandaid after bandaid over an already bleeding wound.

concerto49 · Aug 14, 2014

Wintereise said:
In all honesty, the problem isn't really 'fixable' by upgrading the limit as you go.

You can get away with filtering at /24s now and not care -- but that isn't going to be the case pretty soon, since partial allocations seem to really be on the horizon. What happens when (god forbid) we're forced to accept announces for /29s?

Vendors would probably have done better to invent something that could grow with the network, rather than plastering bandaid after bandaid over an already bleeding wound.

/29? How many LoAs would I have to write? Evil.

Francisco · Aug 14, 2014

concerto49 said:
/29? How many LoAs would I have to write? Evil.

There's a vote at ARIN about it

Francisco

splitice · Aug 14, 2014

How would routes of size /24 be routed if routers without support where to filter them? Do they just get a default route to one upstream carrier and go from there?

Internet routing table breaks 512,000 routes

Company Lube

New Member

New Member

Well-Known Member

New Member

Content Contributer

Company Lube

Never Forget

New Member

VPS Junkie

New Member

Jail

New Member

New Member

Just a little bit crazy...

New Member

New Member

New Member

Company Lube

Just a little bit crazy...