Amplex experienced a network wide problem today at 11:25am. While we are still analyzing the logs we have a good idea of what caused the issue. The network experienced a broadcast storm and loop due to the failure of the mechanisms designed to prevent network loops.
We have seen this same issue twice in the past, approximately one month ago. In those cases the problem occurred late at night and was not noticed by most customers. Following the earlier occurrences we made several changes to the network to remove the lower bandwidth backup paths which caused a significant amount of instability. I can go into much more detail but it’s probably not worth discussing since the important part is…
What are we going to do about keeping it from happening again?
There are several steps we are taking to prevent the issue from occurring in the future:
- Installation of routers at tower sites. We are outgrowing the existing network layout (which has worked well for many years) and will be installing routers at the individual tower sites. This will significantly reduce the broadcast load on the network. We have avoided placing routers at tower sites in the past for reliability reasons. The advantages of individual tower routers now outweighs the risks. Installing routers is low risk and can be done with minimal impact on the network and customers. The first one will be installed at Luckey today.
- Splitting the network into 2 logical parts. The network consists of 2 rings that share a common path between Perrysburg and Lemoyne. The north ring primarily serves sites in Ottawa county, the south ring serves Wood county. We are adding an additional link between Perrysburg and Lemoyne and will use that to isolate the north and south rings. This will reduce the effective size of the network while also helping to isolate issues.
- Evaluating Performant Networks Software Defined Networking gear. Performant has designed a network appliance that promises to improve the stability and recovery time for Ethernet networks by incorporating ITU’s G.8032 “Ethernet Ring Protection Switching”. This standard and equipment allows for sub 50mSec failover in the event of breaks in an Ethernet ring. The Performant equipment adds an additional feature by continuously measuring the actual performance of the links so that it can make intelligent decisions based on the capacity of the individual links. Evaluating and installing this equipment is a long term project as the equipment is new and relatively untested. While it shows great promise we want to run it in a test environment for several weeks before attempting to deploy it.
We understand that a reliable network connection is very important to you and sincerely apologize for the issues today. If you have further questions please do not hesitate to contact us.
Mark Radabaugh, VP Amplex