PDA

View Full Version : Los Angeles Network Issue


Steven
04-13-2007, 12:35 PM
We appear to be having a strange network issue in the Los Angeles Datacenter. We are working on it and will update as soon as we know something..

-Steven

Steven
04-13-2007, 02:51 PM
Everything has been resolved...

-Steven

Steven
04-13-2007, 03:57 PM
Ok guys here is what happened.

At approximately 12:55PM PST (GMT -8:00) we noticed a routing abnormality. One of our Datacenter floors was fully operational while the other was partially inaccessible. The one that was having issues (11th floor space), many clients could connect just fine, while others, including myself could not. We immediately had people working on it to try to identify the issue. 15 Minutes later we posted on the forum as we saw this issue being a very widespread one since more and more tickets were coming in.

We then began a two pronged approach to determine what the issue was. We were looking into network changes (IE config changes on the switches) as well as any possible hardware problems. At 1:30PM we determined this issue to be a hardware issue. We felt that a distribution switch (one that feeds the switches customers are connected to) was dying. Rich was there and I asked him to run a battery of tests. After he ran the tests which included consoling into the distribution switches, we determined that that that switch was operating correctly, and began checking for any code changes. At 2:15PM we grabbed a standby distribution switch (which we have for these cases) . We were then checking the code and routing tables of the distribution switches and the core network switches.

At 3:00PM, Ryan (our main network guru) logged into our core switches and determined that the hardware routing table was full, so it couldn't install the 11th floor routes into its memory, including the arp routes. He then filtered out all routes and 5 minutes later everything came back online. Once that was done, we waited 5 more minutes and then did a reboot of the core network switch and implemented a table limit of 239k route limit installed to prevent the same issue from ever happening again.

-Steven