Your networks underpin the operation of all your IT and networks are often assumed to just work.
Nevertheless, they may contain a considerable amount of sub-optimal configuration, faulty operation due to bugs and sometimes areas of poor design. It may have had a large number of changes or upgrades that in the short term improve functionality, but in the long term make operation more prone to failure.
In effect, all changes, regardless of their source, will move the operation of the network. This is similar to how the centre of gravity can change on an aircraft depending on how it is loaded and can impact its flight characteristics.
Changes, unless carefully planned, can raise doubt about how the network will operate and what the network will do as the use of the system changes.
Good Management is Key
Modern networks are very powerful complex systems; they can establish the fastest routes between end points, manage traffic priorities and provide levels of resilience that should mean that a network is entirely reliable and incredibly efficient.
However, when things go wrong it can be a challenge to untangle what is actually happening, and of course the more complex or sophisticated the network setup, the greater the problems in management, monitoring, fault finding and repair.
- The network design is crucial. If the design does not encompass your needs then it simply will not support what you are doing.
- Someone needs to understand your network design, otherwise you cannot determine if it is behaving as designed.
- Something needs to monitor your network so you can see if it is performing properly.
- Someone needs to watch the monitoring and understand what is happening.
Even if you do all of the above, things still can and often will go wrong. The thing we see is that monitoring is great, but it doesn't always reflect total reality; some of the layers of complexity may not be visible to monitoring, so you could be closer to failure than you might think.
Sophistication and Complexity
Modern network and systems designs, for entirely laudable reasons, often include various layers of resilience, clustering or load-balancing across different resources, be that Servers, Switches, Links and so on.
These design features are intended to provide greater certainty that the overall system can survive a fault with any single component, or better spread the application load across multiple resources.
On numerous occasions, we have found the design doesn't necessarily match the configured reality.
On other occasions, the situation is more subtle, and the configuration or even the underlying network protocols themselves fail to correctly "failover" under certain "edge cases".
Our experience of such issues suggests that any monitoring system should be careful to be aware of all the underlying protocols that provide such resilience services.
As often as not, for instance, monitoring systems provide very simplistic views of bandwidth usage and (for example) whether or not a multi-circuit link is entirely broken - this may well not be sufficient to highlight the sort of protocol issues that mask the true behaviour of the network.
Reducing Doubt and Uncertainty
There are some simple points to keep in mind when designing, building or upgrading networks to reduce doubt and uncertainty:
- Always ensure you can monitor any network device that you install, and that you can access a log of its ongoing operation.
- Understand the limitations of your monitoring.
- When making use of a network device feature - such as a Load-Balancing or Routing Protocol - check that you can monitor that feature itself
- Unless you test it, no amount of monitoring or resilience design will guarantee operation when a failover occurs.