Network downtime – June 3rd 11:24 to 12:31

Amplex experienced a network wide problem today at 11:25am.   While we are still analyzing the logs we have a good idea of what caused the issue.    The network experienced a broadcast storm and loop due to the failure of the mechanisms designed to prevent network loops.

We have seen this same issue twice in the past, approximately one month ago.  In those cases the problem occurred late at night and was not noticed by most customers.  Following the earlier occurrences we made several changes to the network to remove the lower bandwidth backup paths which caused a significant amount of instability.  I can go into much more detail but it’s probably not worth discussing since the important part is…

What are we going to do about keeping it from happening again?

There are several steps we are taking to prevent the issue from occurring in the future:

  • Installation of routers at tower sites.  We are outgrowing the existing network layout (which has worked well for many years) and will be installing routers at the individual tower sites.  This will significantly reduce the broadcast load on the network.  We have avoided placing routers at tower sites in the past for reliability reasons.   The advantages of individual tower routers now outweighs the risks.   Installing routers is low risk and can be done with minimal impact on the network and customers.   The first one will be installed at Luckey today.
  • Splitting the network into 2 logical parts.  The network consists of 2 rings that share a common path between Perrysburg and Lemoyne.  The north ring primarily serves sites in Ottawa county, the south ring serves Wood county.  We are adding an additional link between Perrysburg and Lemoyne and will use that to isolate the north and south rings.  This will reduce the effective size of the network while also helping to isolate issues.
  • Evaluating Performant Networks Software Defined Networking gear.  Performant has designed a network appliance that promises to improve the stability and recovery time for Ethernet networks by incorporating ITU’s G.8032 “Ethernet Ring Protection Switching”.   This standard and equipment allows for sub 50mSec failover in the event of breaks in an Ethernet ring.  The Performant equipment adds an additional feature by continuously measuring the actual performance of the links so that it can make intelligent decisions based on the capacity of the individual links.  Evaluating and installing this equipment is a long term project as the equipment is new and relatively untested.  While it shows great promise we want to run it in a test environment for several weeks before attempting to deploy it.

We understand that a reliable network connection is very important to you and sincerely apologize for the issues today.  If you have further questions please do not hesitate to contact us.

Mark Radabaugh, VP Amplex

Update on new tower sites

Seems like projects always take longer than they should.   In any case…

The Gibsonburg site is up and running.   I am not completely happy with the coverage area we are getting from the 2.4Ghz sector at the site but the 5.7Ghz transmitter is working very well.    As soon as we have the funds we will swap the 2.4 for a couple of sectors which should improve coverage in the area.

The Dirlam Road site just east of Bowling Green is up and running – we will be converting many of the 900Mhz customers south of SugarRidge and/or north-east of the Bays Rd tower to the new site over the next couple of weeks.   This will result in a significant performance increase for those customers.

Rising Sun is on the back burner for the winter – I do not expect to have equipment at Rising Sun until spring 2010.

The North Baltimore / Hoytville site is a project for late December or early January – funding and weather may delay this though.

Partial Internet outage 11/12/08 4:24pm to 4:45pm

We noticed a brief loss of connectivity to some destinations on the Internet this afternoon.   The problem occured in a portion of the Verizon network and affected traffic to some popular destinations such as CNN, MySpace, and Facebook.     The problem cleared while we were analyzing the situation and deciding on a course of action.

Numerous network operators are reporting the problem on outage mailing lists.   Verizon has not issued a statement at this time.   The rumor mill is pointing the finger at Level3 claiming bad announcements from Level3 (another very large network).

So how does all this work you ask?  (or the really short intoduction to BGP).

The Internet is not a single entity but rather a collection of independent networks connected together.  The networks connect to each other at gateway routers.   The gateway routers speak a language (actually a protocol) called BGP where they announce to each other what networks (and destinations) are available by sending traffic through the gateway.

Amplex maintains connections to two large networks (Verizon and Cogent) and we recieve information from both telling our router the fastest way to deliver traffic to it’s destination.   Should a network cease to be able to carry traffic to a particular destination (say MySpace) the neighbor router is supposed to ‘withdraw’ it’s offer to carry traffic to that destination.    When that happens, if we still have a route to the requested destination via our other connection, we will send data out the working connection.   Sometimes the route is withdrawn by both providers at the same time – this likely indicates that the destination network itself is no longer online.

In today’s outage Verizon continued to tell our router that the best path to MySpace, CNN, and other sites was to deliver the traffic to Verizon.   Unfortunatly Verizon was not keeping that promise but rather dropping the traffic inside it’s own network.    While that situation is not supposed to happen it does on fairly rare occasions.

Verizon will likely issue a ‘root cause analysis’ regarding the outage at a later date to explain to the routing engineers at other companies how and why this happened and how to prevent it in the future.

How could Amplex work around this problem?

We would shut down the connection to Verizon which then routes all traffic to Cogent.  Unfortunately this is not a decision to be made lightly since shutting down an upstream carrier causes our own announcements to the rest of the Internet to change.   There can be fairly long waits (and disconnections of existing VPN, Video, and other sessions) while the Internet determines the new best path to reach us.

Once we had established that the problem was at Verizon we were preparing to shut down the connection when the problem in Verizon’s network was resolved.

Why is it so hard to make a small router that works properly?

How Netgear routers manage to blow up the network:

We have a customer that was reporting frequent temporary lockups on his wireless connection.   To diagnose a situation like this we have a variety of standard things that we do:

  • Check the signal strength at the customer premise radio and at the transmitting tower.
  • Check for a high number of re-registrations of the customer radio.
  • Check for errors on the Ethernet interface at the customer site.
  • Verify that the software load on the Canopy radio is current.

Assuming none of the above reveal any problems we use a program called Multiping to ping the customer radio and the customer router.   Multiping sends a ICMP Echo Request to the target computer or router and waitw for the response.  If there is a reply the round trip time is plotted on a graph.  If there is no reply that is marked on the graph as well.

In this case Multiping was showing only an occasional dropped packet (no reply).   This is relatively normal behavior and when kept below 1% it is not an issue unless the drops are sequential.   It is important to note when looking at ICMP reply times that routers (and computers) consider responding to ICMP requests a very low priority – if they respond at all.  The lack of a response, or a high ping time to a router in the network path, does NOT necessarily imply a problem – it’s just another piece of information and must be evaluated along with other troubleshooting steps).

If we can’t find any problem at this point well… hard to say.   The problem could be the customers computer, perhaps the customers routers, maybe the site they are trying to reach, or some other issue outside of our control.   In this case we noticed that the packet loss occurred at the same time for the devices between the Oak Harbor router and the Carroll Water customers.   This pointed to a possible issue at Oak Harbor or with the VLAN we use for the Carroll Water tower.   Last week we tried removing the VLAN from the router at Oak Harbor and moving it’s gateway back to the core router at Lemoyne.      While this initially appeared to have no effect the amount of packet loss on the network radically increased as the network load picked up during the day.  Monitoring the network at the network tap locations did not show any obvious reason for the increased loss.  Due to multiple customer complaints we removed the changes made to Carroll Water midday (something we normally try to avoid during weekdays).

It was very odd that moving the VLAN made things worse – it shouldn’t but it did.   The only possibility left is that the problem is something at Carroll Water or Oak Harbor.    On Wednesday we replaced the router at Oak Harbor – which helped nothing.

On Thursday night around 11:45pm the network monitor indicated problems with much of the network.  Normally when this happens (not that it happens often) it indicates a loop on the network or a broadcast storm.   While troubleshooting something very odd appeared – large quantities of ICMP traffic destined to the customer we have been having a problem with.  The traffic was coming from the public IP address of other customers on the network but carried the payload of the packets from the machine running Multiping.  Even worse – the packets have the ‘broadcast’ flag turned on.

Tracking down the routers the packets are coming from reveals that they are all Netgear routers with static IP addresses assigned.  ARG!   Now it’s obvious what is happening…   A packet destined to the customer gets slightly mangled on the way turning on the broadcast bit.   The Netgear routers fail to detect that the packet checksum doesn’t match (since it’s mangled) and far far worse proceed to create a copy of the packet and send it at the original destination.   All the other Netgear routers on the network hear this broadcast packet and do the same thing.  This is like throwing a ball in a room full of mousetraps – the whole thing blows up.

So now it’s obvious… The reason the customer is having problems isn’t that he is losing connectivity – it’s that he is being buried under bogus traffic from a bunch of buggy Netgear routers.   When we moved the VLAN back to Lemoyne earlier in the week this traffic overload hit the entire network rather than being directed at Carroll Water.

The Solution:

Since we were able to identify all of the customer routers involved we contacted the customers on Friday and had them change the type of connection they use (from Static to NAT).  This prevents the routers from doing what they have been doing.

What a mess…..

Mark

Mail servers were slow today

Mail processing was slowed today due to a high load on the machine that checks mail for viruses and spam. The problem occured while performing upgrades to the operating system.

How is mail processed?   It’s far more complex than it appears…

There are 3 machines responsible for processing mail – 2 machines (named sylvio and paulie) serve as the front end and are responsible for initially receiving incoming and outgoing mail, making a few preliminary checks to see if the recipient is valid, and storing the mail to disk (a process called queuing).   Once the mail is queued a seperate process sends the mail to a third server (tony) to be checked for spam and viruses and then (presuming no viruses were found) returns it to sylvio or paulie where it is again queued to disk.   A third process then collects the queued mail and performs final delivery to the local mailbox (for local users) or the recipient’s mail server (for non-local users).

Why so complex?   A bunch of good reasons actually…

  • 2 front end machines allow us to work on one machine without disrupting mail processing.
  • Spam filtering and virus checking is a slow and difficult process and requires considerable resources (CPU, Memory).   Separating the storage and processing helps prevent client timeouts.   Many mail clients (i.e. Outlook Express, Outlook, Thunderbird, etc.) will generate error messages if the mail server does not accept mail quickly.
  • Delivering mail from disk (rather than from memory) is safer.   By queuing mail to disk before acknowledging acceptance we do not lose mail in the event of a software or server crash.
  • Mail is often bursty in nature – a few messages a minute to hundreds a minute.   Since it’s possible for the incoming rate to exceed the rate that messages can be checked for spam and viruses the front end servers hold the mail until the scanner can check it.

The servers have had an issue for some time where the servers will lock up when requested to make a ‘snapshot’ (backup) of the disk.  The lockup issue is a known problem with the operating system version we have been using.    We are in the process of upgrading the operating system which caused the high load on the server today.