Network Maintenance - Fault Finding not Fault Funding

You have arrived at an excellent network design that will give you the performance and capacity you need. How can you ensure it will be truly fit for purpose and manageable into the future?

When developing a reliable network, or making design changes, there are some important additional measures that will improve your network manageability and hence reduce costs.

Such measures will help support the network in its day to day use, and into the future when changes are made, and in the event of any technical problems which arise.

Adding some basic tools and procedures in at an early stage will make running and managing a network a smoother, easier and more efficient process on a day to day basis.

These same tools and processes will not only support normal day to day running, but will also improve the ability to understand the network and design further changes into the future.

In effect you are providing tools, ahead of time, which will assist in handling problems as they emerge. These tools will allow your network or system engineers to rapidly diagnose, isolate and repair any fault which happens to occur. This will speed repairs, reduce costs, increase efficiency and greatly improve uptime.

This blog is about some of the methods and approaches that will help to make your life and the life of your staff easier, more efficient and will increase reliability of your network and systems.

Oh, and also see Network Design Analysis blog by Phil

Tools and Procedures

Here are some of the tools and procedures that we would suggest you need. Whilst this list is not exhaustive, it's a good starting point and will inspire other ideas.

  • Diagrams - It's always good to have a picture of your system. Depending on the complexity, it may well be worth having several pictures at different levels of abstraction, or perhaps different logical "partitions" of the network. Whilst it is invariably tricky to keep the documentation up to date as things change, this shouldn't be an excuse for leaving the system undocumented.

  • Physical Space Management - When planning and future-proofing the network, be sure to get accurate Architectural Drawings for each room so you are completely aware of any physical limitations that may be imposed by the layout of the room. Is there a false floor? Is the room tall enough to allow for overhead cable runs? Are there any floor loading issues?

  • Power Management - Reducing power consumption, and generally managing power usage, is going to become an ever more important component of system design as global warming becomes an ever hotter topic. Does the existing design allow for granular and remote measurement of power consumption?

  • Address Space Management - Have some sort of database, even if it's only a spreadsheet, which encompasses ALL of your address space allocations. For more complex networks, this might also encompass IPv6 Address Spaces, BGP AS Numbers and even OSPF RouterID's, although the latter is usually dealt with by matching RouterID's to a Switch Loopback Address.

  • VLAN Space Management - Again - have some sort of database, even if it's only a spreadsheet - showing VLAN ID allocations, logical names and descriptions across the Enterprise. Be consistent in the use of the same logical VLAN names for the same VLAN ID across multiple devices, and where the logical name can be configured on the network device itself, make that match the database.

  • Equipment Labelling - It's always good to label things. Dymo is your friend. Label the devices themselves with names, and where appropriate, IP Address. More than this, add meaningful "description" fields to interfaces in Switch configurations so that their purpose is as clear as possible. You may also want to include the device/interface name of the "peer" device in interface descriptions, even if somewhat abbreviated, especially when the "peer" device doesn't support a Discovery Protocol like CDP.
    Another useful label which can be applied is the "location" field configured on a Switch and which is generally visible from either CLI access or an SNMP Management Station. This can potentially detail the Room, Rack, and even the location of the device within the Rack. This allows an on-site Engineer to rapidly find the device, even if he/she is unfamiliar with the Room or Rack. Even simple items like door labels for the equipment rooms shouldn't be overlooked; we had one experience where both the doors and the door labels were in near identical shades of grey which made identifying the rooms themselves initially somewhat challenging.

  • Cable Labelling - Cables themselves should also be labelled; each cable will be numbered with a unique id, either push-on cable numbers or Dymo-like labels that can be wrapped around cable, where the id's are tied back to a database which identifies the cable's endpoints. This will help to avoid the situations discussed here: From Rack to Wreck

  • Discovery - Switch manufacturers - and indeed Server or Hypervisor suppliers - provide various "discovery" protocols whereby each node can regularly tell its neighbour who it is and which ports are interconnected. The classic example is CDP, ( Cisco Discovery Protocol ), which is supported by Cisco and VMWare amongst many others, and LLDP, ( Link Layer Discovery Protocol ), which is an IEEE-designed protocol supported on more recent Cisco hardware as well as by many other manufacturers. These protocols, suitably deployed, allow one to build a topological map of the live network. On rare occasions, this may appear to conflict with either the current Diagrams, or even the Labelling on the Switch interfaces themselves; such discrepancies allow one to actively correct the Diagrams and the Switch Labelling to match reality.

  • Logical Names and Comments - Firewalls often allow one to build firewall rules based purely on IP Addresses. In the long term, this can be confusing for someone coming afresh to the configuration. If you can base the rules on named Groups which themselves contain sets of IP Addresses, and which themselves have a meaningful name, it becomes much easier to decode the overall purpose of the rule in the future. Similarly, if the software allows for adding a Comment against a rule, use the feature enthusiastically.

  • Logging - This warrants a whole tome to itself, but getting each device to log to a central location where they can be easily searched will help in clarifying the nature of a fault when it does occur. See also What to Log

  • DNS Entries - Whilst Windows servers tend to be automatically added to DNS, the same is rarely true for network infrastructure equipment such as Switches, Routers and Firewalls. More complex Switches may also have many different IP Addresses bound to them, but there will usually be one that is used by any Network Management Monitoring software. At the very least, create a DNS entry for each device against that management address so that simple Ping tests or just SSH Access can be made against it by name rather than having to remember some relatively obscure IP Address.

  • Deallocation - The history of your network may involve several different migrations or amalgamations. At each stage, be sure to properly remove any references to legacy systems and devices. Examples of this might be removing references to an old ISP's Address Allocations should you move ISP, or removing DNS and IP Address Allocations for a device once it is decommissioned. All these acts of cleanliness help to reduce confusion for anyone trouble-shooting the network as time goes on.

In summary

Designing and building a network that operates smoothly under normal circumstances shouldn't blind you to the need to make sure that as much in the way of accurate documentation and diagnostic tools are available to the Engineering team to assist in the event of any problems, and to help in planning for the future.


It's Good To Talk

Layer3 Systems has spent 25 years developing solutions against a wide range of challenges, often providing assistance to those companies that are working hard to make their systems work.

Layer3 can be relied upon to help drive the way forward. When you get to the point where you need some spare capacity to support and enhance your business operation you can turn to Layer3 Systems.

If you want to remove problems, reduce uncertainties and improve performance of your systems or networks, or if you're simply looking for additional resources to assist, call us on 020 3805 7795 or email This email address is being protected from spambots. You need JavaScript enabled to view it. we can help!


Print   Email