Why is it so hard to make a small router that works properly?

How Netgear routers manage to blow up the network:

We have a customer that was reporting frequent temporary lockups on his wireless connection.   To diagnose a situation like this we have a variety of standard things that we do:

  • Check the signal strength at the customer premise radio and at the transmitting tower.
  • Check for a high number of re-registrations of the customer radio.
  • Check for errors on the Ethernet interface at the customer site.
  • Verify that the software load on the Canopy radio is current.

Assuming none of the above reveal any problems we use a program called Multiping to ping the customer radio and the customer router.   Multiping sends a ICMP Echo Request to the target computer or router and waitw for the response.  If there is a reply the round trip time is plotted on a graph.  If there is no reply that is marked on the graph as well.

In this case Multiping was showing only an occasional dropped packet (no reply).   This is relatively normal behavior and when kept below 1% it is not an issue unless the drops are sequential.   It is important to note when looking at ICMP reply times that routers (and computers) consider responding to ICMP requests a very low priority – if they respond at all.  The lack of a response, or a high ping time to a router in the network path, does NOT necessarily imply a problem – it’s just another piece of information and must be evaluated along with other troubleshooting steps).

If we can’t find any problem at this point well… hard to say.   The problem could be the customers computer, perhaps the customers routers, maybe the site they are trying to reach, or some other issue outside of our control.   In this case we noticed that the packet loss occurred at the same time for the devices between the Oak Harbor router and the Carroll Water customers.   This pointed to a possible issue at Oak Harbor or with the VLAN we use for the Carroll Water tower.   Last week we tried removing the VLAN from the router at Oak Harbor and moving it’s gateway back to the core router at Lemoyne.      While this initially appeared to have no effect the amount of packet loss on the network radically increased as the network load picked up during the day.  Monitoring the network at the network tap locations did not show any obvious reason for the increased loss.  Due to multiple customer complaints we removed the changes made to Carroll Water midday (something we normally try to avoid during weekdays).

It was very odd that moving the VLAN made things worse – it shouldn’t but it did.   The only possibility left is that the problem is something at Carroll Water or Oak Harbor.    On Wednesday we replaced the router at Oak Harbor – which helped nothing.

On Thursday night around 11:45pm the network monitor indicated problems with much of the network.  Normally when this happens (not that it happens often) it indicates a loop on the network or a broadcast storm.   While troubleshooting something very odd appeared – large quantities of ICMP traffic destined to the customer we have been having a problem with.  The traffic was coming from the public IP address of other customers on the network but carried the payload of the packets from the machine running Multiping.  Even worse – the packets have the ‘broadcast’ flag turned on.

Tracking down the routers the packets are coming from reveals that they are all Netgear routers with static IP addresses assigned.  ARG!   Now it’s obvious what is happening…   A packet destined to the customer gets slightly mangled on the way turning on the broadcast bit.   The Netgear routers fail to detect that the packet checksum doesn’t match (since it’s mangled) and far far worse proceed to create a copy of the packet and send it at the original destination.   All the other Netgear routers on the network hear this broadcast packet and do the same thing.  This is like throwing a ball in a room full of mousetraps – the whole thing blows up.

So now it’s obvious… The reason the customer is having problems isn’t that he is losing connectivity – it’s that he is being buried under bogus traffic from a bunch of buggy Netgear routers.   When we moved the VLAN back to Lemoyne earlier in the week this traffic overload hit the entire network rather than being directed at Carroll Water.

The Solution:

Since we were able to identify all of the customer routers involved we contacted the customers on Friday and had them change the type of connection they use (from Static to NAT).  This prevents the routers from doing what they have been doing.

What a mess…..

Mark